Machine LearningDimensionality Reduction

Kernel PCA

LevelAdvanced

Duration90 mins

TopicDimensionality Reduction

2 / 5

Kernel Trick for PCA

The Mathematical Foundation of Kernel PCA

In the previous page, we established the conceptual motivation for Kernel PCA: standard PCA fails on nonlinearly structured data, and kernel methods offer a path to performing PCA in a high-dimensional feature space without explicitly computing feature vectors. Now we make this rigorous.

The kernel trick is not merely a computational convenience—it is a profound mathematical insight that transforms intractable problems into elegant, solvable ones. By reformulating PCA to depend only on inner products between data points, we unlock the ability to work in infinite-dimensional spaces while performing only finite-dimensional linear algebra.

This page derives the kernel trick for PCA from first principles. We begin with standard PCA, reformulate it in terms of inner products, introduce the kernel function, and arrive at the complete Kernel PCA algorithm. Along the way, we'll see why certain mathematical structures (the dual representation, the Gram matrix, reproducing kernel Hilbert spaces) are not mere technicalities but essential to the algorithm's correctness and efficiency.

What You Will Learn

By the end of this page, you will be able to derive Kernel PCA from first principles, understand why the dual formulation is mathematically necessary, prove that eigenvectors of the kernel matrix correspond to principal components in feature space, and implement the complete algorithm including proper normalization.

Review: Standard PCA Formulation

Before kernelizing PCA, let's establish the standard formulation precisely. Consider a dataset of $n$ samples ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$ where each $\mathbf{x}_i \in \mathbb{R}^d$.

Step 1: Center the Data

Compute the mean: $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i$

Center each point: $\tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}$

Centering is essential—PCA finds directions of maximum variance from the mean, not from the origin.

Step 2: Compute the Covariance Matrix

The sample covariance matrix is: $$\mathbf{C} = \frac{1}{n}\sum_{i=1}^{n}\tilde{\mathbf{x}}_i\tilde{\mathbf{x}}_i^T = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$

where $\tilde{\mathbf{X}} \in \mathbb{R}^{n \times d}$ is the centered data matrix with rows $\tilde{\mathbf{x}}_i^T$.

The covariance matrix $\mathbf{C} \in \mathbb{R}^{d \times d}$ is symmetric and positive semi-definite.

Step 3: Eigendecomposition

Solve the eigenvalue problem: $$\mathbf{C}\mathbf{v} = \lambda\mathbf{v}$$

This yields $d$ eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ and corresponding orthonormal eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_d$.

Step 4: Project Data

Project onto the top $k$ principal components: $$z_j^{(i)} = \tilde{\mathbf{x}}_i^T\mathbf{v}_j = \langle \tilde{\mathbf{x}}_i, \mathbf{v}_j \rangle$$

for $j = 1, \ldots, k$. The reduced representation is $\mathbf{z}_i = (z_1^{(i)}, \ldots, z_k^{(i)})^T \in \mathbb{R}^k$.

Two Views of PCA

PCA can be derived from two equivalent perspectives: (1) Maximum Variance: Find directions maximizing projected variance, (2) Minimum Reconstruction Error: Find directions minimizing squared reconstruction error. Both lead to the same eigenvector solution. For kernelization, the key observation is that both formulations can be expressed using only inner products $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$.

PCA in Feature Space: The Explicit Approach

Now consider a nonlinear mapping $\phi: \mathbb{R}^d \rightarrow \mathcal{F}$ where $\mathcal{F}$ is a (potentially high or infinite-dimensional) feature space. We want to perform PCA on the mapped data ${\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)}$.

Step 1: Center in Feature Space

Mean in feature space: $\bar{\phi} = \frac{1}{n}\sum_{i=1}^{n}\phi(\mathbf{x}_i)$

Centered data: $\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}_i) - \bar{\phi}$

Step 2: Covariance in Feature Space

The covariance matrix in feature space is: $$\mathbf{C}{\mathcal{F}} = \frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\tilde{\phi}(\mathbf{x}_i)^T$$

This is an operator on $\mathcal{F}$. If $\mathcal{F}$ has dimension $D$ (possibly infinite), $\mathbf{C}_{\mathcal{F}}$ is a $D \times D$ matrix.

Step 3: Eigenvalue Problem in Feature Space

We seek eigenvectors $\mathbf{v} \in \mathcal{F}$ satisfying: $$\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$$

The Computational Barrier

If $\mathcal{F}$ is infinite-dimensional (as with the RBF kernel), we cannot explicitly form $\mathbf{C}_{\mathcal{F}}$ or store its eigenvectors. Even for finite but high-dimensional feature spaces, explicit computation becomes intractable. We need a fundamentally different approach that never requires explicit coordinates in $\mathcal{F}$.

The Key Observation: Eigenvector Representation

Here is the crucial insight that enables the kernel trick. Consider the eigenvalue equation: $$\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$$

Substituting the definition of $\mathbf{C}{\mathcal{F}}$: $$\frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\tilde{\phi}(\mathbf{x}_i)^T\mathbf{v} = \lambda\mathbf{v}$$

$$\frac{1}{n}\sum_{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle = \lambda\mathbf{v}$$

The term $\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle$ is a scalar. So the eigenvector $\mathbf{v}$ is a linear combination of ${\tilde{\phi}(\mathbf{x}i)}$: $$\mathbf{v} = \sum{i=1}^{n}\alpha_i \tilde{\phi}(\mathbf{x}_i)$$

for some coefficients $\alpha_1, \ldots, \alpha_n$.

Why This Matters

This representation theorem tells us that even though $\mathbf{v}$ lives in a potentially infinite-dimensional space, it can be fully characterized by $n$ real numbers $(\alpha_1, \ldots, \alpha_n)$. We've reduced an infinite-dimensional problem to a finite-dimensional one!

Deriving the Dual Eigenvalue Problem

Now we derive the eigenvalue problem in terms of the coefficient vector $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^T$.

Substituting the Representation

We have:

Eigenvector: $\mathbf{v} = \sum_{j=1}^{n}\alpha_j \tilde{\phi}(\mathbf{x}_j)$
Eigenvalue equation: $\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$

Take the inner product of both sides with $\tilde{\phi}(\mathbf{x}_\ell)$ for some $\ell \in {1, \ldots, n}$:

$$\langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{C}{\mathcal{F}}\mathbf{v} \rangle = \lambda \langle \tilde{\phi}(\mathbf{x}_\ell), \mathbf{v} \rangle$$

Left Side Expansion

$$\langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{C}{\mathcal{F}}\mathbf{v} \rangle = \left\langle \tilde{\phi}(\mathbf{x}\ell), \frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle \right\rangle$$

$$= \frac{1}{n}\sum_{i=1}^{n}\langle \tilde{\phi}(\mathbf{x}_\ell), \tilde{\phi}(\mathbf{x}_i) \rangle \langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle$$

Now substitute $\mathbf{v} = \sum_{j}\alpha_j \tilde{\phi}(\mathbf{x}_j)$:

$$= \frac{1}{n}\sum_{i=1}^{n}\langle \tilde{\phi}(\mathbf{x}_\ell), \tilde{\phi}(\mathbf{x}i) \rangle \sum{j=1}^{n}\alpha_j\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$$

Introducing the Kernel Matrix

Notice that every term involves inner products of the form $\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle$. If we define the centered kernel matrix $\tilde{K}{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$, all computations reduce to matrix operations on $\tilde{\mathbf{K}}$.

Defining the Centered Kernel Matrix

Let $\tilde{K}{ij} = \langle \tilde{\phi}(\mathbf{x}i), \tilde{\phi}(\mathbf{x}j) \rangle$. The left side becomes: $$\frac{1}{n}\sum{i=1}^{n}\tilde{K}{\ell i}\sum{j=1}^{n}\alpha_j\tilde{K}{ij} = \frac{1}{n}\sum{i=1}^{n}\tilde{K}_{\ell i}(\tilde{\mathbf{K}}\boldsymbol{\alpha})i = \frac{1}{n}(\tilde{\mathbf{K}}^2\boldsymbol{\alpha})\ell$$

Right Side Expansion

$$\lambda \langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{v} \rangle = \lambda \sum{j=1}^{n}\alpha_j\langle \tilde{\phi}(\mathbf{x}\ell), \tilde{\phi}(\mathbf{x}j) \rangle = \lambda \sum{j=1}^{n}\alpha_j\tilde{K}{\ell j} = \lambda(\tilde{\mathbf{K}}\boldsymbol{\alpha})_\ell$$

The Matrix Equation

Since the above holds for all $\ell = 1, \ldots, n$, we have: $$\frac{1}{n}\tilde{\mathbf{K}}^2\boldsymbol{\alpha} = \lambda\tilde{\mathbf{K}}\boldsymbol{\alpha}$$

Multiplying both sides by $\tilde{\mathbf{K}}^{-1}$ (or more carefully, noting this holds for eigenvectors with nonzero eigenvalue): $$\tilde{\mathbf{K}}\boldsymbol{\alpha} = n\lambda\boldsymbol{\alpha}$$

Defining $\mu = n\lambda$, we have the dual eigenvalue problem: $$\tilde{\mathbf{K}}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$$

The Dual Formulation Result

•Eigenvectors of the centered kernel matrix $\tilde{\mathbf{K}}$ give coefficients $\boldsymbol{\alpha}$ for principal components in feature space
•Eigenvalues $\mu$ of $\tilde{\mathbf{K}}$ relate to feature-space eigenvalues by $\lambda = \mu/n$
•We never compute explicit feature vectors $\phi(\mathbf{x})$—only kernel evaluations
•The problem is $n \times n$ regardless of feature space dimensionality

Computing the Centered Kernel Matrix

The derivation above used the centered kernel matrix $\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle$ where $\tilde{\phi}(\mathbf{x}) = \phi(\mathbf{x}) - \bar{\phi}$. But we only have access to the uncentered kernel: $$K{ij} = k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$$

How do we compute the centered kernel matrix without explicit feature vectors?

Deriving the Centering Formula

Recall: $$\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}i) - \frac{1}{n}\sum{m=1}^{n}\phi(\mathbf{x}_m)$$

Therefore: $$\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$$

$$= \left\langle \phi(\mathbf{x}_i) - \frac{1}{n}\sum_m\phi(\mathbf{x}m), \phi(\mathbf{x}j) - \frac{1}{n}\sum\ell\phi(\mathbf{x}\ell) \right\rangle$$

Expanding: $$= \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle - \frac{1}{n}\sum_m\langle \phi(\mathbf{x}m), \phi(\mathbf{x}j) \rangle - \frac{1}{n}\sum\ell\langle \phi(\mathbf{x}i), \phi(\mathbf{x}\ell) \rangle + \frac{1}{n^2}\sum_m\sum\ell\langle \phi(\mathbf{x}m), \phi(\mathbf{x}\ell) \rangle$$

$$= K_{ij} - \frac{1}{n}\sum_m K_{mj} - \frac{1}{n}\sum_\ell K_{i\ell} + \frac{1}{n^2}\sum_m\sum_\ell K_{m\ell}$$

Matrix Formulation of Centering

Let $\mathbf{1}_n$ denote the $n \times n$ matrix of all ones. Define the centering matrix $\mathbf{H} = \mathbf{I}_n - \frac{1}{n}\mathbf{1}_n$. Then the centered kernel matrix is simply $\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$. This is called double centering because we apply centering from both left and right.

The Double Centering Formula

In matrix form: $$\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$$

where $\mathbf{H} = \mathbf{I}_n - \frac{1}{n}\mathbf{1}_n$ is the centering matrix.

Properties of the Centering Matrix

$\mathbf{H}$ is symmetric: $\mathbf{H}^T = \mathbf{H}$
$\mathbf{H}$ is idempotent: $\mathbf{H}^2 = \mathbf{H}$
$\mathbf{H}\mathbf{1} = \mathbf{0}$ (it zeros out the constant vector)
$\text{rank}(\mathbf{H}) = n-1$ (it projects onto the subspace orthogonal to the all-ones vector)

Computational Procedure

Compute the uncentered kernel matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Compute column means: $\bar{K}{\cdot j} = \frac{1}{n}\sum_i K{ij}$
Compute row means: $\bar{K}{i \cdot} = \frac{1}{n}\sum_j K{ij}$
Compute grand mean: $\bar{K}{\cdot\cdot} = \frac{1}{n^2}\sum_i\sum_j K{ij}$
Apply: $\tilde{K}{ij} = K{ij} - \bar{K}{i\cdot} - \bar{K}{\cdot j} + \bar{K}_{\cdot\cdot}$

This requires $O(n^2)$ operations, same as computing the kernel matrix itself.

kernel_centering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def center_kernel_matrix(K: np.ndarray) -> np.ndarray:
    """
    Center the kernel matrix in feature space using double centering.
    
    Implements: K_tilde = H @ K @ H where H = I - (1/n) * ones
    
    Parameters:
        K: Uncentered kernel matrix of shape (n, n)
        
    Returns:
        K_tilde: Centered kernel matrix of shape (n, n)
    """
    n = K.shape[0]
    
    # Method 1: Explicit formula (numerically stable)
    row_means = K.mean(axis=1, keepdims=True)  # Shape: (n, 1)
    col_means = K.mean(axis=0, keepdims=True)  # Shape: (1, n)
    grand_mean = K.mean()
    
    K_centered = K - row_means - col_means + grand_mean
    
    return K_centered
 
 
def center_kernel_matrix_alternative(K: np.ndarray) -> np.ndarray:
    """
    Alternative formulation using the centering matrix H explicitly.
    Mathematically equivalent but slightly less efficient.
    """
    n = K.shape[0]
    H = np.eye(n) - np.ones((n, n)) / n
    
    return H @ K @ H
 
 
# Verify equivalence
np.random.seed(42)
n = 100
X = np.random.randn(n, 5)
K = X @ X.T  # Linear kernel for demonstration
 
K_centered_1 = center_kernel_matrix(K)
K_centered_2 = center_kernel_matrix_alternative(K)
 
print(f"Max difference between methods: {np.abs(K_centered_1 - K_centered_2).max():.2e}")
# Output: Max difference between methods: 2.84e-14 (numerical precision)

Normalization: Ensuring Unit-Length Principal Components

When we solve the dual eigenvalue problem $\tilde{\mathbf{K}}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$, we obtain eigenvectors $\boldsymbol{\alpha}$ that are normalized in the coefficient space (i.e., $|\boldsymbol{\alpha}| = 1$). However, this does not mean the corresponding principal component $\mathbf{v} = \sum_i \alpha_i \tilde{\phi}(\mathbf{x}_i)$ has unit norm in feature space.

For proper PCA, we need $|\mathbf{v}| = 1$ in $\mathcal{F}$.

Computing the Norm in Feature Space

$$|\mathbf{v}|^2 = \langle \mathbf{v}, \mathbf{v} \rangle = \left\langle \sum_i \alpha_i \tilde{\phi}(\mathbf{x}_i), \sum_j \alpha_j \tilde{\phi}(\mathbf{x}_j) \right\rangle$$

$$= \sum_i \sum_j \alpha_i \alpha_j \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle = \sum_i \sum_j \alpha_i \alpha_j \tilde{K}{ij}$$

$$= \boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha}$$

Since $\boldsymbol{\alpha}$ is an eigenvector with eigenvalue $\mu$: $$\boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha} = \boldsymbol{\alpha}^T (\mu \boldsymbol{\alpha}) = \mu |\boldsymbol{\alpha}|^2$$

If $\boldsymbol{\alpha}$ is normalized to $|\boldsymbol{\alpha}| = 1$, then $|\mathbf{v}|^2 = \mu$.

The Normalization Rule

To ensure $|\mathbf{v}| = 1$, we must normalize the eigenvector $\boldsymbol{\alpha}$ such that $\boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha} = 1$, which means $\mu |\boldsymbol{\alpha}|^2 = 1$, or equivalently $|\boldsymbol{\alpha}| = 1/\sqrt{\mu}$.

Normalization Procedure

For each eigenvector $\boldsymbol{\alpha}^{(k)}$ with eigenvalue $\mu_k$:

Ensure eigenvalues are positive: $\mu_k > 0$ (discard zero eigenvalues)
Normalize: $\boldsymbol{\alpha}^{(k)} \leftarrow \boldsymbol{\alpha}^{(k)} / \sqrt{\mu_k}$

After this normalization, projecting any point onto the principal component gives: $$z_k(\mathbf{x}) = \langle \mathbf{v}_k, \tilde{\phi}(\mathbf{x}) \rangle = \sum_i \alpha_i^{(k)} \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}) \rangle$$

Why Zero Eigenvalues Must Be Discarded

If $\mu_k = 0$, then $\tilde{\mathbf{K}}\boldsymbol{\alpha}^{(k)} = \mathbf{0}$, which means: $$|\mathbf{v}_k|^2 = \boldsymbol{\alpha}^{(k)T}\tilde{\mathbf{K}}\boldsymbol{\alpha}^{(k)} = 0$$

The corresponding principal component has zero length in feature space—it's a trivial direction. These should be excluded from the final reduced representation.

kpca_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
def normalize_kpca_eigenvectors(eigenvalues: np.ndarray, 
                                 eigenvectors: np.ndarray, 
                                 tol: float = 1e-10) -> tuple:
    """
    Normalize KPCA eigenvectors for unit-length principal components in feature space.
    
    Parameters:
        eigenvalues: Array of eigenvalues from kernel matrix
        eigenvectors: Matrix where columns are eigenvectors (n x n)
        tol: Threshold below which eigenvalues are considered zero
        
    Returns:
        eigenvalues_valid: Non-zero eigenvalues
        alphas_normalized: Normalized eigenvectors (columns)
    """
    # Sort by eigenvalue (descending)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Keep only positive eigenvalues
    valid = eigenvalues > tol
    eigenvalues_valid = eigenvalues[valid]
    eigenvectors_valid = eigenvectors[:, valid]
    
    # Normalize: divide each eigenvector by sqrt(eigenvalue)
    # This ensures ||v_k|| = 1 in feature space
    alphas_normalized = eigenvectors_valid / np.sqrt(eigenvalues_valid)
    
    return eigenvalues_valid, alphas_normalized
 
 
# Example usage
np.random.seed(42)
n = 50
K_centered = np.random.randn(n, n)
K_centered = K_centered @ K_centered.T  # Make symmetric PSD
 
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
eigenvalues_valid, alphas = normalize_kpca_eigenvectors(eigenvalues, eigenvectors)
 
print(f"Number of non-trivial components: {len(eigenvalues_valid)}")
 
# Verify normalization: alpha^T @ K @ alpha should equal 1
for k in range(min(3, len(eigenvalues_valid))):
    alpha_k = alphas[:, k]
    norm_squared = alpha_k @ K_centered @ alpha_k
    print(f"Component {k+1}: ||v||^2 = {norm_squared:.6f} (should be ≈ 1.0)")

Projecting New Points onto Kernel Principal Components

A critical capability for any dimensionality reduction method is projecting new, unseen data points onto the learned reduced representation. For Kernel PCA, this requires careful handling of centering.

The Projection Formula

For a new point $\mathbf{x}^$ (not in the training set), the projection onto the $k$-th principal component is: $$z_k(\mathbf{x}^) = \langle \mathbf{v}_k, \tilde{\phi}(\mathbf{x}^*) \rangle$$

Expanding: $$z_k(\mathbf{x}^) = \sum_{i=1}^{n} \alpha_i^{(k)} \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^) \rangle$$

The inner product involves centered feature vectors, but we can only compute uncentered kernel values $k(\mathbf{x}_i, \mathbf{x}^*)$.

Centering for Test Points

We need to compute: $$\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^) \rangle = \langle \phi(\mathbf{x}_i) - \bar{\phi}, \phi(\mathbf{x}^) - \bar{\phi} \rangle$$

where $\bar{\phi} = \frac{1}{n}\sum_m \phi(\mathbf{x}_m)$ is the mean from the training set.

Expanding: $$= k(\mathbf{x}_i, \mathbf{x}^) - \frac{1}{n}\sum_m k(\mathbf{x}_m, \mathbf{x}^) - \frac{1}{n}\sum_m k(\mathbf{x}_i, \mathbf{x}m) + \frac{1}{n^2}\sum_m \sum\ell k(\mathbf{x}m, \mathbf{x}\ell)$$

Test Point Centering Formula

Let $\mathbf{k}^* = [k(\mathbf{x}_1, \mathbf{x}^), \ldots, k(\mathbf{x}_n, \mathbf{x}^)]^T$ be the kernel vector for the test point. The centered kernel vector is: $\tilde{\mathbf{k}}^* = \mathbf{k}^* - \frac{1}{n}\mathbf{K}\mathbf{1} - \frac{1}{n}(\mathbf{1}^T \mathbf{k}^*)\mathbf{1} + \frac{1}{n^2}\mathbf{1}^T\mathbf{K}\mathbf{1}$

Simplified Formulation

Define:

$\mathbf{k}^* \in \mathbb{R}^n$: kernel vector with $k^_i = k(\mathbf{x}_i, \mathbf{x}^)$
$\bar{\mathbf{k}} = \frac{1}{n}\mathbf{K}\mathbf{1}$: column means of training kernel matrix
$\bar{k}^* = \frac{1}{n}\sum_i k(\mathbf{x}_i, \mathbf{x}^*)$: mean kernel value for test point
$\bar{K} = \frac{1}{n^2}\mathbf{1}^T\mathbf{K}\mathbf{1}$: grand mean of training kernel matrix

The centered kernel vector is: $$\tilde{\mathbf{k}}^* = \mathbf{k}^* - \bar{\mathbf{k}} - \bar{k}^* \mathbf{1} + \bar{K} \mathbf{1}$$

The projection is then: $$z_k(\mathbf{x}^) = \sum_i \alpha_i^{(k)} \tilde{k}^_i = (\boldsymbol{\alpha}^{(k)})^T \tilde{\mathbf{k}}^*$$

Complete Projection Procedure

Compute kernel values: $k^_i = k(\mathbf{x}_i, \mathbf{x}^)$ for $i = 1, \ldots, n$
Center the kernel vector using training statistics
Compute projections: $z_k = (\boldsymbol{\alpha}^{(k)})^T \tilde{\mathbf{k}}^*$ for each component $k$

Note: Training statistics ($\bar{\mathbf{k}}$, $\bar{K}$) must be stored during training for use at test time.

kpca_projection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Callable
 
class KernelPCA:
    """
    Complete Kernel PCA implementation with proper centering for test points.
    """
    
    def __init__(self, n_components: int, kernel: Callable):
        """
        Parameters:
            n_components: Number of principal components to extract
            kernel: Kernel function k(X, Y) returning matrix of kernel values
        """
        self.n_components = n_components
        self.kernel = kernel
        
        # Stored after fitting
        self.X_train = None
        self.alphas = None           # Normalized eigenvectors
        self.eigenvalues = None      # Corresponding eigenvalues
        self.K_train = None          # Training kernel matrix
        self.row_means = None        # For centering test points
        self.grand_mean = None       # For centering test points
    
    def fit(self, X: np.ndarray) -> 'KernelPCA':
        """
        Fit Kernel PCA to training data.
        """
        self.X_train = X
        n = X.shape[0]
        
        # Compute training kernel matrix
        self.K_train = self.kernel(X, X)
        
        # Store centering statistics
        self.row_means = self.K_train.mean(axis=0)  # Column means (symmetric)
        self.grand_mean = self.K_train.mean()
        
        # Center the kernel matrix
        K_centered = (self.K_train 
                     - self.row_means.reshape(-1, 1)
                     - self.row_means.reshape(1, -1) 
                     + self.grand_mean)
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
        
        # Sort descending
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Keep top k positive eigenvalues
        valid = eigenvalues > 1e-10
        eigenvalues = eigenvalues[valid][:self.n_components]
        eigenvectors = eigenvectors[:, valid][:, :self.n_components]
        
        # Normalize for unit-length in feature space
        self.alphas = eigenvectors / np.sqrt(eigenvalues)
        self.eigenvalues = eigenvalues
        
        return self
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Project new points onto kernel principal components.
        """
        # Kernel values between test and training points
        K_test = self.kernel(X, self.X_train)  # (n_test, n_train)
        
        # Center using training statistics
        test_means = K_test.mean(axis=1, keepdims=True)  # Mean across training for each test
        K_test_centered = (K_test 
                          - self.row_means.reshape(1, -1)  # Training column means
                          - test_means                      # Test row means  
                          + self.grand_mean)               # Training grand mean
        
        # Project onto principal components
        Z = K_test_centered @ self.alphas
        
        return Z
    
    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
 
 
# Example: RBF kernel
def rbf_kernel(X, Y, gamma=0.1):
    """Compute RBF kernel matrix between X and Y."""
    X_sq = np.sum(X**2, axis=1, keepdims=True)
    Y_sq = np.sum(Y**2, axis=1, keepdims=True)
    dist_sq = X_sq + Y_sq.T - 2 * X @ Y.T
    return np.exp(-gamma * dist_sq)
 
 
# Demo with concentric circles
from sklearn.datasets import make_circles
 
X, y = make_circles(n_samples=200, noise=0.05, factor=0.3, random_state=42)
 
kpca = KernelPCA(n_components=2, kernel=lambda X, Y: rbf_kernel(X, Y, gamma=5))
Z = kpca.fit_transform(X)
 
print(f"Original shape: {X.shape}")
print(f"Transformed shape: {Z.shape}")
print(f"Eigenvalues: {kpca.eigenvalues[:5]}")

Complete Kernel PCA Algorithm

We now have all the pieces for a complete Kernel PCA implementation. Let's consolidate the algorithm.

Kernel PCA Algorithm

•Input: Training data ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, kernel function $k$, number of components $d$
•Compute Kernel Matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ for all pairs — $O(n^2)$ kernel evaluations
•Center Kernel Matrix: $\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$ where $\mathbf{H} = \mathbf{I} - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ — $O(n^2)$ operations
•Eigendecomposition: Find top $d$ eigenpairs $(\mu_k, \boldsymbol{\beta}^{(k)})$ of $\tilde{\mathbf{K}}$ — $O(n^3)$ or $O(n^2 d)$ for top-$d$
•Normalize Eigenvectors: $\boldsymbol{\alpha}^{(k)} = \boldsymbol{\beta}^{(k)} / \sqrt{\mu_k}$ for $\mu_k > 0$ — ensures $|\mathbf{v}_k| = 1$ in feature space
•Store Statistics: Save $\mathbf{X}$, $\boldsymbol{\alpha}^{(k)}$, row means, grand mean for test-time projection
•Output: Projection function $\mathbf{x}^* \mapsto \mathbf{z}^* \in \mathbb{R}^d$

Computational Complexity
Step	Time Complexity	Space Complexity	Notes
Kernel matrix computation	$O(n^2 d)$	$O(n^2)$	Depends on kernel; $d$ is input dimension
Centering	$O(n^2)$	$O(n^2)$	In-place possible
Full eigendecomposition	$O(n^3)$	$O(n^2)$	Standard algorithm
Partial eigendecomposition	$O(n^2 k)$	$O(nk)$	Using iterative methods for top $k$
Projection (per test point)	$O(nd + nk)$	$O(n)$	$n$ kernel evals + $k$ dot products

Scalability Considerations

Kernel PCA's $O(n^3)$ training time and $O(n^2)$ memory make it challenging for large datasets. For $n > 10,000$, consider approximations: Nyström method (random subset of kernel matrix), Random Fourier Features (approximate RBF kernel), or Incremental KPCA.

Summary

We have derived Kernel PCA from first principles, showing exactly how the kernel trick enables PCA in potentially infinite-dimensional feature spaces.

Key Takeaways

•Eigenvectors in feature space are linear combinations of mapped training points — This representation theorem is the foundation of the dual formulation
•The dual eigenvalue problem involves the $n \times n$ kernel matrix — Eigenvalues $\mu$ of $\tilde{\mathbf{K}}$ give feature-space eigenvalues $\lambda = \mu/n$
•Double centering computes the centered kernel matrix — $\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$ without explicit feature vectors
•Normalization ensures unit-length principal components — Divide eigenvectors by $\sqrt{\mu}$ for proper projections
•Test point centering uses training statistics — The mean in feature space is fixed by the training set
•Complexity is $O(n^3)$ training, $O(n)$ per test projection — Scalability is a practical limitation for large $n$

Mathematical Foundation Complete

You now have a complete mathematical understanding of the kernel trick for PCA. The next page addresses an important subtlety: centering in feature space. We'll explore what centering means geometrically, why improper centering corrupts results, and how to verify correct implementation.

What's Next:

The next page dives deeper into centering in feature space—a step that looks simple but contains important subtleties. We'll visualize what centering does geometrically, explore what happens when centering is omitted or done incorrectly, and develop diagnostic tools to verify correct implementation.

2 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Kernel PCA

LevelAdvanced

Duration90 mins

TopicDimensionality Reduction

2 / 5

Kernel Trick for PCA

The Mathematical Foundation of Kernel PCA

What You Will Learn

Review: Standard PCA Formulation

Before kernelizing PCA, let's establish the standard formulation precisely. Consider a dataset of $n$ samples ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$ where each $\mathbf{x}_i \in \mathbb{R}^d$.

Step 1: Center the Data

Compute the mean: $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i$

Center each point: $\tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}$

Centering is essential—PCA finds directions of maximum variance from the mean, not from the origin.

Step 2: Compute the Covariance Matrix

The sample covariance matrix is: $$\mathbf{C} = \frac{1}{n}\sum_{i=1}^{n}\tilde{\mathbf{x}}_i\tilde{\mathbf{x}}_i^T = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$

where $\tilde{\mathbf{X}} \in \mathbb{R}^{n \times d}$ is the centered data matrix with rows $\tilde{\mathbf{x}}_i^T$.

The covariance matrix $\mathbf{C} \in \mathbb{R}^{d \times d}$ is symmetric and positive semi-definite.

Step 3: Eigendecomposition

Solve the eigenvalue problem: $$\mathbf{C}\mathbf{v} = \lambda\mathbf{v}$$

This yields $d$ eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ and corresponding orthonormal eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_d$.

Step 4: Project Data

Project onto the top $k$ principal components: $$z_j^{(i)} = \tilde{\mathbf{x}}_i^T\mathbf{v}_j = \langle \tilde{\mathbf{x}}_i, \mathbf{v}_j \rangle$$

for $j = 1, \ldots, k$. The reduced representation is $\mathbf{z}_i = (z_1^{(i)}, \ldots, z_k^{(i)})^T \in \mathbb{R}^k$.

Two Views of PCA

PCA in Feature Space: The Explicit Approach

Step 1: Center in Feature Space

Mean in feature space: $\bar{\phi} = \frac{1}{n}\sum_{i=1}^{n}\phi(\mathbf{x}_i)$

Centered data: $\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}_i) - \bar{\phi}$

Step 2: Covariance in Feature Space

The covariance matrix in feature space is: $$\mathbf{C}{\mathcal{F}} = \frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\tilde{\phi}(\mathbf{x}_i)^T$$

This is an operator on $\mathcal{F}$. If $\mathcal{F}$ has dimension $D$ (possibly infinite), $\mathbf{C}_{\mathcal{F}}$ is a $D \times D$ matrix.

Step 3: Eigenvalue Problem in Feature Space

We seek eigenvectors $\mathbf{v} \in \mathcal{F}$ satisfying: $$\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$$

The Computational Barrier

The Key Observation: Eigenvector Representation

Here is the crucial insight that enables the kernel trick. Consider the eigenvalue equation: $$\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$$

Substituting the definition of $\mathbf{C}{\mathcal{F}}$: $$\frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\tilde{\phi}(\mathbf{x}_i)^T\mathbf{v} = \lambda\mathbf{v}$$

$$\frac{1}{n}\sum_{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle = \lambda\mathbf{v}$$

for some coefficients $\alpha_1, \ldots, \alpha_n$.

Why This Matters

Deriving the Dual Eigenvalue Problem

Now we derive the eigenvalue problem in terms of the coefficient vector $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^T$.

Substituting the Representation

We have:

Eigenvector: $\mathbf{v} = \sum_{j=1}^{n}\alpha_j \tilde{\phi}(\mathbf{x}_j)$
Eigenvalue equation: $\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$

Take the inner product of both sides with $\tilde{\phi}(\mathbf{x}_\ell)$ for some $\ell \in {1, \ldots, n}$:

$$\langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{C}{\mathcal{F}}\mathbf{v} \rangle = \lambda \langle \tilde{\phi}(\mathbf{x}_\ell), \mathbf{v} \rangle$$

Left Side Expansion

$$= \frac{1}{n}\sum_{i=1}^{n}\langle \tilde{\phi}(\mathbf{x}_\ell), \tilde{\phi}(\mathbf{x}_i) \rangle \langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle$$

Now substitute $\mathbf{v} = \sum_{j}\alpha_j \tilde{\phi}(\mathbf{x}_j)$:

$$= \frac{1}{n}\sum_{i=1}^{n}\langle \tilde{\phi}(\mathbf{x}_\ell), \tilde{\phi}(\mathbf{x}i) \rangle \sum{j=1}^{n}\alpha_j\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$$

Introducing the Kernel Matrix

Defining the Centered Kernel Matrix

Right Side Expansion

The Matrix Equation

Since the above holds for all $\ell = 1, \ldots, n$, we have: $$\frac{1}{n}\tilde{\mathbf{K}}^2\boldsymbol{\alpha} = \lambda\tilde{\mathbf{K}}\boldsymbol{\alpha}$$

Defining $\mu = n\lambda$, we have the dual eigenvalue problem: $$\tilde{\mathbf{K}}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$$

The Dual Formulation Result

•Eigenvectors of the centered kernel matrix $\tilde{\mathbf{K}}$ give coefficients $\boldsymbol{\alpha}$ for principal components in feature space
•Eigenvalues $\mu$ of $\tilde{\mathbf{K}}$ relate to feature-space eigenvalues by $\lambda = \mu/n$
•We never compute explicit feature vectors $\phi(\mathbf{x})$—only kernel evaluations
•The problem is $n \times n$ regardless of feature space dimensionality

Computing the Centered Kernel Matrix

How do we compute the centered kernel matrix without explicit feature vectors?

Deriving the Centering Formula

Recall: $$\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}i) - \frac{1}{n}\sum{m=1}^{n}\phi(\mathbf{x}_m)$$

Therefore: $$\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$$

$$= \left\langle \phi(\mathbf{x}_i) - \frac{1}{n}\sum_m\phi(\mathbf{x}m), \phi(\mathbf{x}j) - \frac{1}{n}\sum\ell\phi(\mathbf{x}\ell) \right\rangle$$

$$= K_{ij} - \frac{1}{n}\sum_m K_{mj} - \frac{1}{n}\sum_\ell K_{i\ell} + \frac{1}{n^2}\sum_m\sum_\ell K_{m\ell}$$

Matrix Formulation of Centering

The Double Centering Formula

In matrix form: $$\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$$

where $\mathbf{H} = \mathbf{I}_n - \frac{1}{n}\mathbf{1}_n$ is the centering matrix.

Properties of the Centering Matrix

$\mathbf{H}$ is symmetric: $\mathbf{H}^T = \mathbf{H}$
$\mathbf{H}$ is idempotent: $\mathbf{H}^2 = \mathbf{H}$
$\mathbf{H}\mathbf{1} = \mathbf{0}$ (it zeros out the constant vector)
$\text{rank}(\mathbf{H}) = n-1$ (it projects onto the subspace orthogonal to the all-ones vector)

Computational Procedure

Compute the uncentered kernel matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Compute column means: $\bar{K}{\cdot j} = \frac{1}{n}\sum_i K{ij}$
Compute row means: $\bar{K}{i \cdot} = \frac{1}{n}\sum_j K{ij}$
Compute grand mean: $\bar{K}{\cdot\cdot} = \frac{1}{n^2}\sum_i\sum_j K{ij}$
Apply: $\tilde{K}{ij} = K{ij} - \bar{K}{i\cdot} - \bar{K}{\cdot j} + \bar{K}_{\cdot\cdot}$

This requires $O(n^2)$ operations, same as computing the kernel matrix itself.

kernel_centering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def center_kernel_matrix(K: np.ndarray) -> np.ndarray:
    """
    Center the kernel matrix in feature space using double centering.
    
    Implements: K_tilde = H @ K @ H where H = I - (1/n) * ones
    
    Parameters:
        K: Uncentered kernel matrix of shape (n, n)
        
    Returns:
        K_tilde: Centered kernel matrix of shape (n, n)
    """
    n = K.shape[0]
    
    # Method 1: Explicit formula (numerically stable)
    row_means = K.mean(axis=1, keepdims=True)  # Shape: (n, 1)
    col_means = K.mean(axis=0, keepdims=True)  # Shape: (1, n)
    grand_mean = K.mean()
    
    K_centered = K - row_means - col_means + grand_mean
    
    return K_centered
 
 
def center_kernel_matrix_alternative(K: np.ndarray) -> np.ndarray:
    """
    Alternative formulation using the centering matrix H explicitly.
    Mathematically equivalent but slightly less efficient.
    """
    n = K.shape[0]
    H = np.eye(n) - np.ones((n, n)) / n
    
    return H @ K @ H
 
 
# Verify equivalence
np.random.seed(42)
n = 100
X = np.random.randn(n, 5)
K = X @ X.T  # Linear kernel for demonstration
 
K_centered_1 = center_kernel_matrix(K)
K_centered_2 = center_kernel_matrix_alternative(K)
 
print(f"Max difference between methods: {np.abs(K_centered_1 - K_centered_2).max():.2e}")
# Output: Max difference between methods: 2.84e-14 (numerical precision)

Normalization: Ensuring Unit-Length Principal Components

For proper PCA, we need $|\mathbf{v}| = 1$ in $\mathcal{F}$.

Computing the Norm in Feature Space

$$|\mathbf{v}|^2 = \langle \mathbf{v}, \mathbf{v} \rangle = \left\langle \sum_i \alpha_i \tilde{\phi}(\mathbf{x}_i), \sum_j \alpha_j \tilde{\phi}(\mathbf{x}_j) \right\rangle$$

$$= \sum_i \sum_j \alpha_i \alpha_j \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle = \sum_i \sum_j \alpha_i \alpha_j \tilde{K}{ij}$$

$$= \boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha}$$

If $\boldsymbol{\alpha}$ is normalized to $|\boldsymbol{\alpha}| = 1$, then $|\mathbf{v}|^2 = \mu$.

The Normalization Rule

Normalization Procedure

For each eigenvector $\boldsymbol{\alpha}^{(k)}$ with eigenvalue $\mu_k$:

Ensure eigenvalues are positive: $\mu_k > 0$ (discard zero eigenvalues)
Normalize: $\boldsymbol{\alpha}^{(k)} \leftarrow \boldsymbol{\alpha}^{(k)} / \sqrt{\mu_k}$

Why Zero Eigenvalues Must Be Discarded

If $\mu_k = 0$, then $\tilde{\mathbf{K}}\boldsymbol{\alpha}^{(k)} = \mathbf{0}$, which means: $$|\mathbf{v}_k|^2 = \boldsymbol{\alpha}^{(k)T}\tilde{\mathbf{K}}\boldsymbol{\alpha}^{(k)} = 0$$

The corresponding principal component has zero length in feature space—it's a trivial direction. These should be excluded from the final reduced representation.

kpca_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
def normalize_kpca_eigenvectors(eigenvalues: np.ndarray, 
                                 eigenvectors: np.ndarray, 
                                 tol: float = 1e-10) -> tuple:
    """
    Normalize KPCA eigenvectors for unit-length principal components in feature space.
    
    Parameters:
        eigenvalues: Array of eigenvalues from kernel matrix
        eigenvectors: Matrix where columns are eigenvectors (n x n)
        tol: Threshold below which eigenvalues are considered zero
        
    Returns:
        eigenvalues_valid: Non-zero eigenvalues
        alphas_normalized: Normalized eigenvectors (columns)
    """
    # Sort by eigenvalue (descending)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Keep only positive eigenvalues
    valid = eigenvalues > tol
    eigenvalues_valid = eigenvalues[valid]
    eigenvectors_valid = eigenvectors[:, valid]
    
    # Normalize: divide each eigenvector by sqrt(eigenvalue)
    # This ensures ||v_k|| = 1 in feature space
    alphas_normalized = eigenvectors_valid / np.sqrt(eigenvalues_valid)
    
    return eigenvalues_valid, alphas_normalized
 
 
# Example usage
np.random.seed(42)
n = 50
K_centered = np.random.randn(n, n)
K_centered = K_centered @ K_centered.T  # Make symmetric PSD
 
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
eigenvalues_valid, alphas = normalize_kpca_eigenvectors(eigenvalues, eigenvectors)
 
print(f"Number of non-trivial components: {len(eigenvalues_valid)}")
 
# Verify normalization: alpha^T @ K @ alpha should equal 1
for k in range(min(3, len(eigenvalues_valid))):
    alpha_k = alphas[:, k]
    norm_squared = alpha_k @ K_centered @ alpha_k
    print(f"Component {k+1}: ||v||^2 = {norm_squared:.6f} (should be ≈ 1.0)")

Projecting New Points onto Kernel Principal Components

The Projection Formula

For a new point $\mathbf{x}^$ (not in the training set), the projection onto the $k$-th principal component is: $$z_k(\mathbf{x}^) = \langle \mathbf{v}_k, \tilde{\phi}(\mathbf{x}^*) \rangle$$

Expanding: $$z_k(\mathbf{x}^) = \sum_{i=1}^{n} \alpha_i^{(k)} \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^) \rangle$$

The inner product involves centered feature vectors, but we can only compute uncentered kernel values $k(\mathbf{x}_i, \mathbf{x}^*)$.

Centering for Test Points

We need to compute: $$\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^) \rangle = \langle \phi(\mathbf{x}_i) - \bar{\phi}, \phi(\mathbf{x}^) - \bar{\phi} \rangle$$

where $\bar{\phi} = \frac{1}{n}\sum_m \phi(\mathbf{x}_m)$ is the mean from the training set.

Test Point Centering Formula

Simplified Formulation

Define:

$\mathbf{k}^* \in \mathbb{R}^n$: kernel vector with $k^_i = k(\mathbf{x}_i, \mathbf{x}^)$
$\bar{\mathbf{k}} = \frac{1}{n}\mathbf{K}\mathbf{1}$: column means of training kernel matrix
$\bar{k}^* = \frac{1}{n}\sum_i k(\mathbf{x}_i, \mathbf{x}^*)$: mean kernel value for test point
$\bar{K} = \frac{1}{n^2}\mathbf{1}^T\mathbf{K}\mathbf{1}$: grand mean of training kernel matrix

The centered kernel vector is: $$\tilde{\mathbf{k}}^* = \mathbf{k}^* - \bar{\mathbf{k}} - \bar{k}^* \mathbf{1} + \bar{K} \mathbf{1}$$

The projection is then: $$z_k(\mathbf{x}^) = \sum_i \alpha_i^{(k)} \tilde{k}^_i = (\boldsymbol{\alpha}^{(k)})^T \tilde{\mathbf{k}}^*$$

Complete Projection Procedure

Compute kernel values: $k^_i = k(\mathbf{x}_i, \mathbf{x}^)$ for $i = 1, \ldots, n$
Center the kernel vector using training statistics
Compute projections: $z_k = (\boldsymbol{\alpha}^{(k)})^T \tilde{\mathbf{k}}^*$ for each component $k$

Note: Training statistics ($\bar{\mathbf{k}}$, $\bar{K}$) must be stored during training for use at test time.

kpca_projection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Callable
 
class KernelPCA:
    """
    Complete Kernel PCA implementation with proper centering for test points.
    """
    
    def __init__(self, n_components: int, kernel: Callable):
        """
        Parameters:
            n_components: Number of principal components to extract
            kernel: Kernel function k(X, Y) returning matrix of kernel values
        """
        self.n_components = n_components
        self.kernel = kernel
        
        # Stored after fitting
        self.X_train = None
        self.alphas = None           # Normalized eigenvectors
        self.eigenvalues = None      # Corresponding eigenvalues
        self.K_train = None          # Training kernel matrix
        self.row_means = None        # For centering test points
        self.grand_mean = None       # For centering test points
    
    def fit(self, X: np.ndarray) -> 'KernelPCA':
        """
        Fit Kernel PCA to training data.
        """
        self.X_train = X
        n = X.shape[0]
        
        # Compute training kernel matrix
        self.K_train = self.kernel(X, X)
        
        # Store centering statistics
        self.row_means = self.K_train.mean(axis=0)  # Column means (symmetric)
        self.grand_mean = self.K_train.mean()
        
        # Center the kernel matrix
        K_centered = (self.K_train 
                     - self.row_means.reshape(-1, 1)
                     - self.row_means.reshape(1, -1) 
                     + self.grand_mean)
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
        
        # Sort descending
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Keep top k positive eigenvalues
        valid = eigenvalues > 1e-10
        eigenvalues = eigenvalues[valid][:self.n_components]
        eigenvectors = eigenvectors[:, valid][:, :self.n_components]
        
        # Normalize for unit-length in feature space
        self.alphas = eigenvectors / np.sqrt(eigenvalues)
        self.eigenvalues = eigenvalues
        
        return self
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Project new points onto kernel principal components.
        """
        # Kernel values between test and training points
        K_test = self.kernel(X, self.X_train)  # (n_test, n_train)
        
        # Center using training statistics
        test_means = K_test.mean(axis=1, keepdims=True)  # Mean across training for each test
        K_test_centered = (K_test 
                          - self.row_means.reshape(1, -1)  # Training column means
                          - test_means                      # Test row means  
                          + self.grand_mean)               # Training grand mean
        
        # Project onto principal components
        Z = K_test_centered @ self.alphas
        
        return Z
    
    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
 
 
# Example: RBF kernel
def rbf_kernel(X, Y, gamma=0.1):
    """Compute RBF kernel matrix between X and Y."""
    X_sq = np.sum(X**2, axis=1, keepdims=True)
    Y_sq = np.sum(Y**2, axis=1, keepdims=True)
    dist_sq = X_sq + Y_sq.T - 2 * X @ Y.T
    return np.exp(-gamma * dist_sq)
 
 
# Demo with concentric circles
from sklearn.datasets import make_circles
 
X, y = make_circles(n_samples=200, noise=0.05, factor=0.3, random_state=42)
 
kpca = KernelPCA(n_components=2, kernel=lambda X, Y: rbf_kernel(X, Y, gamma=5))
Z = kpca.fit_transform(X)
 
print(f"Original shape: {X.shape}")
print(f"Transformed shape: {Z.shape}")
print(f"Eigenvalues: {kpca.eigenvalues[:5]}")

Complete Kernel PCA Algorithm

We now have all the pieces for a complete Kernel PCA implementation. Let's consolidate the algorithm.

Kernel PCA Algorithm

•Input: Training data ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, kernel function $k$, number of components $d$
•Compute Kernel Matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ for all pairs — $O(n^2)$ kernel evaluations
•Center Kernel Matrix: $\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$ where $\mathbf{H} = \mathbf{I} - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ — $O(n^2)$ operations
•Eigendecomposition: Find top $d$ eigenpairs $(\mu_k, \boldsymbol{\beta}^{(k)})$ of $\tilde{\mathbf{K}}$ — $O(n^3)$ or $O(n^2 d)$ for top-$d$
•Normalize Eigenvectors: $\boldsymbol{\alpha}^{(k)} = \boldsymbol{\beta}^{(k)} / \sqrt{\mu_k}$ for $\mu_k > 0$ — ensures $|\mathbf{v}_k| = 1$ in feature space
•Store Statistics: Save $\mathbf{X}$, $\boldsymbol{\alpha}^{(k)}$, row means, grand mean for test-time projection
•Output: Projection function $\mathbf{x}^* \mapsto \mathbf{z}^* \in \mathbb{R}^d$

Computational Complexity
Step	Time Complexity	Space Complexity	Notes
Kernel matrix computation	$O(n^2 d)$	$O(n^2)$	Depends on kernel; $d$ is input dimension
Centering	$O(n^2)$	$O(n^2)$	In-place possible
Full eigendecomposition	$O(n^3)$	$O(n^2)$	Standard algorithm
Partial eigendecomposition	$O(n^2 k)$	$O(nk)$	Using iterative methods for top $k$
Projection (per test point)	$O(nd + nk)$	$O(n)$	$n$ kernel evals + $k$ dot products

Scalability Considerations

Summary

We have derived Kernel PCA from first principles, showing exactly how the kernel trick enables PCA in potentially infinite-dimensional feature spaces.

Key Takeaways

•Eigenvectors in feature space are linear combinations of mapped training points — This representation theorem is the foundation of the dual formulation
•The dual eigenvalue problem involves the $n \times n$ kernel matrix — Eigenvalues $\mu$ of $\tilde{\mathbf{K}}$ give feature-space eigenvalues $\lambda = \mu/n$
•Double centering computes the centered kernel matrix — $\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$ without explicit feature vectors
•Normalization ensures unit-length principal components — Divide eigenvectors by $\sqrt{\mu}$ for proper projections
•Test point centering uses training statistics — The mean in feature space is fixed by the training set
•Complexity is $O(n^3)$ training, $O(n)$ per test projection — Scalability is a practical limitation for large $n$

Mathematical Foundation Complete

What's Next:

2 / 5