Loading learning content...
In the previous page, we established the conceptual motivation for Kernel PCA: standard PCA fails on nonlinearly structured data, and kernel methods offer a path to performing PCA in a high-dimensional feature space without explicitly computing feature vectors. Now we make this rigorous.
The kernel trick is not merely a computational convenience—it is a profound mathematical insight that transforms intractable problems into elegant, solvable ones. By reformulating PCA to depend only on inner products between data points, we unlock the ability to work in infinite-dimensional spaces while performing only finite-dimensional linear algebra.
This page derives the kernel trick for PCA from first principles. We begin with standard PCA, reformulate it in terms of inner products, introduce the kernel function, and arrive at the complete Kernel PCA algorithm. Along the way, we'll see why certain mathematical structures (the dual representation, the Gram matrix, reproducing kernel Hilbert spaces) are not mere technicalities but essential to the algorithm's correctness and efficiency.
By the end of this page, you will be able to derive Kernel PCA from first principles, understand why the dual formulation is mathematically necessary, prove that eigenvectors of the kernel matrix correspond to principal components in feature space, and implement the complete algorithm including proper normalization.
Before kernelizing PCA, let's establish the standard formulation precisely. Consider a dataset of $n$ samples ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$ where each $\mathbf{x}_i \in \mathbb{R}^d$.
Step 1: Center the Data
Compute the mean: $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i$
Center each point: $\tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}$
Centering is essential—PCA finds directions of maximum variance from the mean, not from the origin.
Step 2: Compute the Covariance Matrix
The sample covariance matrix is: $$\mathbf{C} = \frac{1}{n}\sum_{i=1}^{n}\tilde{\mathbf{x}}_i\tilde{\mathbf{x}}_i^T = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$
where $\tilde{\mathbf{X}} \in \mathbb{R}^{n \times d}$ is the centered data matrix with rows $\tilde{\mathbf{x}}_i^T$.
The covariance matrix $\mathbf{C} \in \mathbb{R}^{d \times d}$ is symmetric and positive semi-definite.
Step 3: Eigendecomposition
Solve the eigenvalue problem: $$\mathbf{C}\mathbf{v} = \lambda\mathbf{v}$$
This yields $d$ eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ and corresponding orthonormal eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_d$.
Step 4: Project Data
Project onto the top $k$ principal components: $$z_j^{(i)} = \tilde{\mathbf{x}}_i^T\mathbf{v}_j = \langle \tilde{\mathbf{x}}_i, \mathbf{v}_j \rangle$$
for $j = 1, \ldots, k$. The reduced representation is $\mathbf{z}_i = (z_1^{(i)}, \ldots, z_k^{(i)})^T \in \mathbb{R}^k$.
PCA can be derived from two equivalent perspectives: (1) Maximum Variance: Find directions maximizing projected variance, (2) Minimum Reconstruction Error: Find directions minimizing squared reconstruction error. Both lead to the same eigenvector solution. For kernelization, the key observation is that both formulations can be expressed using only inner products $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$.
Now consider a nonlinear mapping $\phi: \mathbb{R}^d \rightarrow \mathcal{F}$ where $\mathcal{F}$ is a (potentially high or infinite-dimensional) feature space. We want to perform PCA on the mapped data ${\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)}$.
Step 1: Center in Feature Space
Mean in feature space: $\bar{\phi} = \frac{1}{n}\sum_{i=1}^{n}\phi(\mathbf{x}_i)$
Centered data: $\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}_i) - \bar{\phi}$
Step 2: Covariance in Feature Space
The covariance matrix in feature space is: $$\mathbf{C}{\mathcal{F}} = \frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\tilde{\phi}(\mathbf{x}_i)^T$$
This is an operator on $\mathcal{F}$. If $\mathcal{F}$ has dimension $D$ (possibly infinite), $\mathbf{C}_{\mathcal{F}}$ is a $D \times D$ matrix.
Step 3: Eigenvalue Problem in Feature Space
We seek eigenvectors $\mathbf{v} \in \mathcal{F}$ satisfying: $$\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$$
If $\mathcal{F}$ is infinite-dimensional (as with the RBF kernel), we cannot explicitly form $\mathbf{C}_{\mathcal{F}}$ or store its eigenvectors. Even for finite but high-dimensional feature spaces, explicit computation becomes intractable. We need a fundamentally different approach that never requires explicit coordinates in $\mathcal{F}$.
The Key Observation: Eigenvector Representation
Here is the crucial insight that enables the kernel trick. Consider the eigenvalue equation: $$\mathbf{C}_{\mathcal{F}}\mathbf{v} = \lambda\mathbf{v}$$
Substituting the definition of $\mathbf{C}{\mathcal{F}}$: $$\frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\tilde{\phi}(\mathbf{x}_i)^T\mathbf{v} = \lambda\mathbf{v}$$
$$\frac{1}{n}\sum_{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle = \lambda\mathbf{v}$$
The term $\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle$ is a scalar. So the eigenvector $\mathbf{v}$ is a linear combination of ${\tilde{\phi}(\mathbf{x}i)}$: $$\mathbf{v} = \sum{i=1}^{n}\alpha_i \tilde{\phi}(\mathbf{x}_i)$$
for some coefficients $\alpha_1, \ldots, \alpha_n$.
Why This Matters
This representation theorem tells us that even though $\mathbf{v}$ lives in a potentially infinite-dimensional space, it can be fully characterized by $n$ real numbers $(\alpha_1, \ldots, \alpha_n)$. We've reduced an infinite-dimensional problem to a finite-dimensional one!
Now we derive the eigenvalue problem in terms of the coefficient vector $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^T$.
Substituting the Representation
We have:
Take the inner product of both sides with $\tilde{\phi}(\mathbf{x}_\ell)$ for some $\ell \in {1, \ldots, n}$:
$$\langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{C}{\mathcal{F}}\mathbf{v} \rangle = \lambda \langle \tilde{\phi}(\mathbf{x}_\ell), \mathbf{v} \rangle$$
Left Side Expansion
$$\langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{C}{\mathcal{F}}\mathbf{v} \rangle = \left\langle \tilde{\phi}(\mathbf{x}\ell), \frac{1}{n}\sum{i=1}^{n}\tilde{\phi}(\mathbf{x}_i)\langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle \right\rangle$$
$$= \frac{1}{n}\sum_{i=1}^{n}\langle \tilde{\phi}(\mathbf{x}_\ell), \tilde{\phi}(\mathbf{x}_i) \rangle \langle \tilde{\phi}(\mathbf{x}_i), \mathbf{v} \rangle$$
Now substitute $\mathbf{v} = \sum_{j}\alpha_j \tilde{\phi}(\mathbf{x}_j)$:
$$= \frac{1}{n}\sum_{i=1}^{n}\langle \tilde{\phi}(\mathbf{x}_\ell), \tilde{\phi}(\mathbf{x}i) \rangle \sum{j=1}^{n}\alpha_j\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$$
Notice that every term involves inner products of the form $\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle$. If we define the centered kernel matrix $\tilde{K}{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$, all computations reduce to matrix operations on $\tilde{\mathbf{K}}$.
Defining the Centered Kernel Matrix
Let $\tilde{K}{ij} = \langle \tilde{\phi}(\mathbf{x}i), \tilde{\phi}(\mathbf{x}j) \rangle$. The left side becomes: $$\frac{1}{n}\sum{i=1}^{n}\tilde{K}{\ell i}\sum{j=1}^{n}\alpha_j\tilde{K}{ij} = \frac{1}{n}\sum{i=1}^{n}\tilde{K}_{\ell i}(\tilde{\mathbf{K}}\boldsymbol{\alpha})i = \frac{1}{n}(\tilde{\mathbf{K}}^2\boldsymbol{\alpha})\ell$$
Right Side Expansion
$$\lambda \langle \tilde{\phi}(\mathbf{x}\ell), \mathbf{v} \rangle = \lambda \sum{j=1}^{n}\alpha_j\langle \tilde{\phi}(\mathbf{x}\ell), \tilde{\phi}(\mathbf{x}j) \rangle = \lambda \sum{j=1}^{n}\alpha_j\tilde{K}{\ell j} = \lambda(\tilde{\mathbf{K}}\boldsymbol{\alpha})_\ell$$
The Matrix Equation
Since the above holds for all $\ell = 1, \ldots, n$, we have: $$\frac{1}{n}\tilde{\mathbf{K}}^2\boldsymbol{\alpha} = \lambda\tilde{\mathbf{K}}\boldsymbol{\alpha}$$
Multiplying both sides by $\tilde{\mathbf{K}}^{-1}$ (or more carefully, noting this holds for eigenvectors with nonzero eigenvalue): $$\tilde{\mathbf{K}}\boldsymbol{\alpha} = n\lambda\boldsymbol{\alpha}$$
Defining $\mu = n\lambda$, we have the dual eigenvalue problem: $$\tilde{\mathbf{K}}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$$
The derivation above used the centered kernel matrix $\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle$ where $\tilde{\phi}(\mathbf{x}) = \phi(\mathbf{x}) - \bar{\phi}$. But we only have access to the uncentered kernel: $$K{ij} = k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$$
How do we compute the centered kernel matrix without explicit feature vectors?
Deriving the Centering Formula
Recall: $$\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}i) - \frac{1}{n}\sum{m=1}^{n}\phi(\mathbf{x}_m)$$
Therefore: $$\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle$$
$$= \left\langle \phi(\mathbf{x}_i) - \frac{1}{n}\sum_m\phi(\mathbf{x}m), \phi(\mathbf{x}j) - \frac{1}{n}\sum\ell\phi(\mathbf{x}\ell) \right\rangle$$
Expanding: $$= \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle - \frac{1}{n}\sum_m\langle \phi(\mathbf{x}m), \phi(\mathbf{x}j) \rangle - \frac{1}{n}\sum\ell\langle \phi(\mathbf{x}i), \phi(\mathbf{x}\ell) \rangle + \frac{1}{n^2}\sum_m\sum\ell\langle \phi(\mathbf{x}m), \phi(\mathbf{x}\ell) \rangle$$
$$= K_{ij} - \frac{1}{n}\sum_m K_{mj} - \frac{1}{n}\sum_\ell K_{i\ell} + \frac{1}{n^2}\sum_m\sum_\ell K_{m\ell}$$
Let $\mathbf{1}_n$ denote the $n \times n$ matrix of all ones. Define the centering matrix $\mathbf{H} = \mathbf{I}_n - \frac{1}{n}\mathbf{1}_n$. Then the centered kernel matrix is simply $\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$. This is called double centering because we apply centering from both left and right.
The Double Centering Formula
In matrix form: $$\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$$
where $\mathbf{H} = \mathbf{I}_n - \frac{1}{n}\mathbf{1}_n$ is the centering matrix.
Properties of the Centering Matrix
Computational Procedure
This requires $O(n^2)$ operations, same as computing the kernel matrix itself.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import numpy as np def center_kernel_matrix(K: np.ndarray) -> np.ndarray: """ Center the kernel matrix in feature space using double centering. Implements: K_tilde = H @ K @ H where H = I - (1/n) * ones Parameters: K: Uncentered kernel matrix of shape (n, n) Returns: K_tilde: Centered kernel matrix of shape (n, n) """ n = K.shape[0] # Method 1: Explicit formula (numerically stable) row_means = K.mean(axis=1, keepdims=True) # Shape: (n, 1) col_means = K.mean(axis=0, keepdims=True) # Shape: (1, n) grand_mean = K.mean() K_centered = K - row_means - col_means + grand_mean return K_centered def center_kernel_matrix_alternative(K: np.ndarray) -> np.ndarray: """ Alternative formulation using the centering matrix H explicitly. Mathematically equivalent but slightly less efficient. """ n = K.shape[0] H = np.eye(n) - np.ones((n, n)) / n return H @ K @ H # Verify equivalencenp.random.seed(42)n = 100X = np.random.randn(n, 5)K = X @ X.T # Linear kernel for demonstration K_centered_1 = center_kernel_matrix(K)K_centered_2 = center_kernel_matrix_alternative(K) print(f"Max difference between methods: {np.abs(K_centered_1 - K_centered_2).max():.2e}")# Output: Max difference between methods: 2.84e-14 (numerical precision)When we solve the dual eigenvalue problem $\tilde{\mathbf{K}}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$, we obtain eigenvectors $\boldsymbol{\alpha}$ that are normalized in the coefficient space (i.e., $|\boldsymbol{\alpha}| = 1$). However, this does not mean the corresponding principal component $\mathbf{v} = \sum_i \alpha_i \tilde{\phi}(\mathbf{x}_i)$ has unit norm in feature space.
For proper PCA, we need $|\mathbf{v}| = 1$ in $\mathcal{F}$.
Computing the Norm in Feature Space
$$|\mathbf{v}|^2 = \langle \mathbf{v}, \mathbf{v} \rangle = \left\langle \sum_i \alpha_i \tilde{\phi}(\mathbf{x}_i), \sum_j \alpha_j \tilde{\phi}(\mathbf{x}_j) \right\rangle$$
$$= \sum_i \sum_j \alpha_i \alpha_j \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}j) \rangle = \sum_i \sum_j \alpha_i \alpha_j \tilde{K}{ij}$$
$$= \boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha}$$
Since $\boldsymbol{\alpha}$ is an eigenvector with eigenvalue $\mu$: $$\boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha} = \boldsymbol{\alpha}^T (\mu \boldsymbol{\alpha}) = \mu |\boldsymbol{\alpha}|^2$$
If $\boldsymbol{\alpha}$ is normalized to $|\boldsymbol{\alpha}| = 1$, then $|\mathbf{v}|^2 = \mu$.
To ensure $|\mathbf{v}| = 1$, we must normalize the eigenvector $\boldsymbol{\alpha}$ such that $\boldsymbol{\alpha}^T \tilde{\mathbf{K}} \boldsymbol{\alpha} = 1$, which means $\mu |\boldsymbol{\alpha}|^2 = 1$, or equivalently $|\boldsymbol{\alpha}| = 1/\sqrt{\mu}$.
Normalization Procedure
For each eigenvector $\boldsymbol{\alpha}^{(k)}$ with eigenvalue $\mu_k$:
After this normalization, projecting any point onto the principal component gives: $$z_k(\mathbf{x}) = \langle \mathbf{v}_k, \tilde{\phi}(\mathbf{x}) \rangle = \sum_i \alpha_i^{(k)} \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}) \rangle$$
Why Zero Eigenvalues Must Be Discarded
If $\mu_k = 0$, then $\tilde{\mathbf{K}}\boldsymbol{\alpha}^{(k)} = \mathbf{0}$, which means: $$|\mathbf{v}_k|^2 = \boldsymbol{\alpha}^{(k)T}\tilde{\mathbf{K}}\boldsymbol{\alpha}^{(k)} = 0$$
The corresponding principal component has zero length in feature space—it's a trivial direction. These should be excluded from the final reduced representation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np def normalize_kpca_eigenvectors(eigenvalues: np.ndarray, eigenvectors: np.ndarray, tol: float = 1e-10) -> tuple: """ Normalize KPCA eigenvectors for unit-length principal components in feature space. Parameters: eigenvalues: Array of eigenvalues from kernel matrix eigenvectors: Matrix where columns are eigenvectors (n x n) tol: Threshold below which eigenvalues are considered zero Returns: eigenvalues_valid: Non-zero eigenvalues alphas_normalized: Normalized eigenvectors (columns) """ # Sort by eigenvalue (descending) idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Keep only positive eigenvalues valid = eigenvalues > tol eigenvalues_valid = eigenvalues[valid] eigenvectors_valid = eigenvectors[:, valid] # Normalize: divide each eigenvector by sqrt(eigenvalue) # This ensures ||v_k|| = 1 in feature space alphas_normalized = eigenvectors_valid / np.sqrt(eigenvalues_valid) return eigenvalues_valid, alphas_normalized # Example usagenp.random.seed(42)n = 50K_centered = np.random.randn(n, n)K_centered = K_centered @ K_centered.T # Make symmetric PSD eigenvalues, eigenvectors = np.linalg.eigh(K_centered)eigenvalues_valid, alphas = normalize_kpca_eigenvectors(eigenvalues, eigenvectors) print(f"Number of non-trivial components: {len(eigenvalues_valid)}") # Verify normalization: alpha^T @ K @ alpha should equal 1for k in range(min(3, len(eigenvalues_valid))): alpha_k = alphas[:, k] norm_squared = alpha_k @ K_centered @ alpha_k print(f"Component {k+1}: ||v||^2 = {norm_squared:.6f} (should be ≈ 1.0)")A critical capability for any dimensionality reduction method is projecting new, unseen data points onto the learned reduced representation. For Kernel PCA, this requires careful handling of centering.
The Projection Formula
For a new point $\mathbf{x}^$ (not in the training set), the projection onto the $k$-th principal component is: $$z_k(\mathbf{x}^) = \langle \mathbf{v}_k, \tilde{\phi}(\mathbf{x}^*) \rangle$$
Expanding: $$z_k(\mathbf{x}^) = \sum_{i=1}^{n} \alpha_i^{(k)} \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^) \rangle$$
The inner product involves centered feature vectors, but we can only compute uncentered kernel values $k(\mathbf{x}_i, \mathbf{x}^*)$.
Centering for Test Points
We need to compute: $$\langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^) \rangle = \langle \phi(\mathbf{x}_i) - \bar{\phi}, \phi(\mathbf{x}^) - \bar{\phi} \rangle$$
where $\bar{\phi} = \frac{1}{n}\sum_m \phi(\mathbf{x}_m)$ is the mean from the training set.
Expanding: $$= k(\mathbf{x}_i, \mathbf{x}^) - \frac{1}{n}\sum_m k(\mathbf{x}_m, \mathbf{x}^) - \frac{1}{n}\sum_m k(\mathbf{x}_i, \mathbf{x}m) + \frac{1}{n^2}\sum_m \sum\ell k(\mathbf{x}m, \mathbf{x}\ell)$$
Let $\mathbf{k}^* = [k(\mathbf{x}_1, \mathbf{x}^), \ldots, k(\mathbf{x}_n, \mathbf{x}^)]^T$ be the kernel vector for the test point. The centered kernel vector is: $\tilde{\mathbf{k}}^* = \mathbf{k}^* - \frac{1}{n}\mathbf{K}\mathbf{1} - \frac{1}{n}(\mathbf{1}^T \mathbf{k}^*)\mathbf{1} + \frac{1}{n^2}\mathbf{1}^T\mathbf{K}\mathbf{1}$
Simplified Formulation
Define:
The centered kernel vector is: $$\tilde{\mathbf{k}}^* = \mathbf{k}^* - \bar{\mathbf{k}} - \bar{k}^* \mathbf{1} + \bar{K} \mathbf{1}$$
The projection is then: $$z_k(\mathbf{x}^) = \sum_i \alpha_i^{(k)} \tilde{k}^_i = (\boldsymbol{\alpha}^{(k)})^T \tilde{\mathbf{k}}^*$$
Complete Projection Procedure
Note: Training statistics ($\bar{\mathbf{k}}$, $\bar{K}$) must be stored during training for use at test time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom typing import Callable class KernelPCA: """ Complete Kernel PCA implementation with proper centering for test points. """ def __init__(self, n_components: int, kernel: Callable): """ Parameters: n_components: Number of principal components to extract kernel: Kernel function k(X, Y) returning matrix of kernel values """ self.n_components = n_components self.kernel = kernel # Stored after fitting self.X_train = None self.alphas = None # Normalized eigenvectors self.eigenvalues = None # Corresponding eigenvalues self.K_train = None # Training kernel matrix self.row_means = None # For centering test points self.grand_mean = None # For centering test points def fit(self, X: np.ndarray) -> 'KernelPCA': """ Fit Kernel PCA to training data. """ self.X_train = X n = X.shape[0] # Compute training kernel matrix self.K_train = self.kernel(X, X) # Store centering statistics self.row_means = self.K_train.mean(axis=0) # Column means (symmetric) self.grand_mean = self.K_train.mean() # Center the kernel matrix K_centered = (self.K_train - self.row_means.reshape(-1, 1) - self.row_means.reshape(1, -1) + self.grand_mean) # Eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(K_centered) # Sort descending idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Keep top k positive eigenvalues valid = eigenvalues > 1e-10 eigenvalues = eigenvalues[valid][:self.n_components] eigenvectors = eigenvectors[:, valid][:, :self.n_components] # Normalize for unit-length in feature space self.alphas = eigenvectors / np.sqrt(eigenvalues) self.eigenvalues = eigenvalues return self def transform(self, X: np.ndarray) -> np.ndarray: """ Project new points onto kernel principal components. """ # Kernel values between test and training points K_test = self.kernel(X, self.X_train) # (n_test, n_train) # Center using training statistics test_means = K_test.mean(axis=1, keepdims=True) # Mean across training for each test K_test_centered = (K_test - self.row_means.reshape(1, -1) # Training column means - test_means # Test row means + self.grand_mean) # Training grand mean # Project onto principal components Z = K_test_centered @ self.alphas return Z def fit_transform(self, X: np.ndarray) -> np.ndarray: """Fit and transform in one step.""" self.fit(X) return self.transform(X) # Example: RBF kerneldef rbf_kernel(X, Y, gamma=0.1): """Compute RBF kernel matrix between X and Y.""" X_sq = np.sum(X**2, axis=1, keepdims=True) Y_sq = np.sum(Y**2, axis=1, keepdims=True) dist_sq = X_sq + Y_sq.T - 2 * X @ Y.T return np.exp(-gamma * dist_sq) # Demo with concentric circlesfrom sklearn.datasets import make_circles X, y = make_circles(n_samples=200, noise=0.05, factor=0.3, random_state=42) kpca = KernelPCA(n_components=2, kernel=lambda X, Y: rbf_kernel(X, Y, gamma=5))Z = kpca.fit_transform(X) print(f"Original shape: {X.shape}")print(f"Transformed shape: {Z.shape}")print(f"Eigenvalues: {kpca.eigenvalues[:5]}")We now have all the pieces for a complete Kernel PCA implementation. Let's consolidate the algorithm.
| Step | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Kernel matrix computation | $O(n^2 d)$ | $O(n^2)$ | Depends on kernel; $d$ is input dimension |
| Centering | $O(n^2)$ | $O(n^2)$ | In-place possible |
| Full eigendecomposition | $O(n^3)$ | $O(n^2)$ | Standard algorithm |
| Partial eigendecomposition | $O(n^2 k)$ | $O(nk)$ | Using iterative methods for top $k$ |
| Projection (per test point) | $O(nd + nk)$ | $O(n)$ | $n$ kernel evals + $k$ dot products |
Kernel PCA's $O(n^3)$ training time and $O(n^2)$ memory make it challenging for large datasets. For $n > 10,000$, consider approximations: Nyström method (random subset of kernel matrix), Random Fourier Features (approximate RBF kernel), or Incremental KPCA.
We have derived Kernel PCA from first principles, showing exactly how the kernel trick enables PCA in potentially infinite-dimensional feature spaces.
You now have a complete mathematical understanding of the kernel trick for PCA. The next page addresses an important subtlety: centering in feature space. We'll explore what centering means geometrically, why improper centering corrupts results, and how to verify correct implementation.
What's Next:
The next page dives deeper into centering in feature space—a step that looks simple but contains important subtleties. We'll visualize what centering does geometrically, explore what happens when centering is omitted or done incorrectly, and develop diagnostic tools to verify correct implementation.