Loading learning content...
Kernel PCA elegantly projects data into a reduced representation by operating implicitly in a high-dimensional feature space. But what if we want to go backwards? Given a point $\mathbf{z}$ in the reduced kernel PCA space, can we find a corresponding point $\mathbf{x}$ in the original input space?
This is the pre-image problem, and it presents a fundamental challenge. Unlike linear PCA, where we can trivially reconstruct from projections using the transpose of the projection matrix, Kernel PCA's implicit feature mapping makes reconstruction far more difficult—in general, impossible in closed form.
The pre-image problem is not merely theoretical. It arises naturally in applications:
This page develops the theory of the pre-image problem, explains why exact solutions are typically impossible, and presents practical approximation methods.
By the end of this page, you will understand why the pre-image problem is fundamentally ill-posed for nonlinear kernels, how to formulate it as an optimization problem, what iterative methods exist for approximate solutions, and when pre-image estimation is reliable versus unreliable.
Let's first recall how reconstruction works in standard linear PCA, to see exactly what becomes difficult in the kernel case.
Linear PCA Projection
Given centered data $\tilde{\mathbf{x}} = \mathbf{x} - \bar{\mathbf{x}}$ and principal component matrix $\mathbf{V}_k = [\mathbf{v}_1, \ldots, \mathbf{v}_k]$, the projection is: $$\mathbf{z} = \mathbf{V}_k^T \tilde{\mathbf{x}} \in \mathbb{R}^k$$
Linear PCA Reconstruction
To reconstruct (approximately) from the reduced representation: $$\hat{\tilde{\mathbf{x}}} = \mathbf{V}_k \mathbf{z} = \mathbf{V}_k \mathbf{V}_k^T \tilde{\mathbf{x}}$$ $$\hat{\mathbf{x}} = \hat{\tilde{\mathbf{x}}} + \bar{\mathbf{x}}$$
This is exact if we use all $d$ components; with $k < d$ components, we get the best rank-$k$ approximation (in squared error).
Why This Works
In linear PCA, the projection matrix $\mathbf{P} = \mathbf{V}_k\mathbf{V}_k^T$ is an orthogonal projector onto the principal subspace. Any point in the input space projects to this subspace, and points within the subspace are their own pre-images. The mapping from subspace to input space is the identity (embedded in higher dimensions).
What Changes with Kernels
In Kernel PCA, the situation is fundamentally different:
Eigenvectors are in feature space: $\mathbf{v}_k \in \mathcal{F}$, not $\mathbb{R}^d$. We can't represent them explicitly.
The mapping $\phi$ is nonlinear and not invertible: Even if we could compute a point in feature space, mapping it back to input space is a separate (hard) problem.
Feature space may be infinite-dimensional: The projection lives in feature space, which can have higher (or infinite) dimension than the input space.
No pseudo-inverse exists in the usual sense: The relationship between input space and feature space is fundamentally asymmetric.
The Core Difficulty
In Kernel PCA, we compute projections of the form: $$z_k = \langle \mathbf{v}k, \phi(\mathbf{x}) \rangle = \sum{i=1}^n \alpha_i^{(k)} k(\mathbf{x}_i, \mathbf{x})$$
Given $\mathbf{z} = (z_1, \ldots, z_k)$, we want $\mathbf{x}$. But this requires inverting both the kernel function and the projection—neither of which has a closed-form inverse.
Let's formalize what we're trying to achieve and why it's fundamentally difficult.
The Feature Space Reconstruction
In Kernel PCA, the projection of $\phi(\mathbf{x})$ onto the first $k$ principal components in feature space is: $$\mathbf{P}k \tilde{\phi}(\mathbf{x}) = \sum{j=1}^k \mathbf{v}_j \langle \mathbf{v}j, \tilde{\phi}(\mathbf{x}) \rangle = \sum{j=1}^k z_j \mathbf{v}_j$$
This is a point in $\mathcal{F}$ (the feature space). If we denote this reconstructed point in feature space as $\Phi_{rec}$, the pre-image problem asks:
Find $\mathbf{x}$ such that $\phi(\mathbf{x}) = \Phi_{rec}$ (or as close as possible)
Why Exact Pre-images May Not Exist
The reconstructed point $\Phi_{rec}$ lies in the subspace spanned by ${\mathbf{v}_1, \ldots, \mathbf{v}_k}$. But not every point in feature space corresponds to some input $\mathbf{x}$—only points in the image of $\phi$: $${\phi(\mathbf{x}) : \mathbf{x} \in \mathcal{X}} \subset \mathcal{F}$$
This image is generally a curved manifold in feature space, not a linear subspace. The projection onto principal components typically does not land exactly on this manifold.
For most nonlinear kernels, there is NO input $\mathbf{x}$ such that $\phi(\mathbf{x})$ exactly equals the feature-space reconstruction $\Phi_{rec}$. The pre-image problem is inherently ill-posed—we must settle for approximations.
Approximate Pre-image Formulation
Since exact pre-images typically don't exist, we seek approximate solutions:
$$\hat{\mathbf{x}} = \arg\min_{\mathbf{x} \in \mathcal{X}} | \phi(\mathbf{x}) - \Phi_{rec} |_{\mathcal{F}}^2$$
This minimizes the squared distance in feature space between the image of $\mathbf{x}$ and the reconstruction target.
Expanding the Objective
$$| \phi(\mathbf{x}) - \Phi_{rec} |^2 = | \phi(\mathbf{x}) |^2 - 2 \langle \phi(\mathbf{x}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$
The third term is constant with respect to $\mathbf{x}$. Let's analyze the first two:
Term 1: $|\phi(\mathbf{x})|^2 = k(\mathbf{x}, \mathbf{x})$
For the RBF kernel, $k(\mathbf{x}, \mathbf{x}) = 1$ for all $\mathbf{x}$, so this is constant.
Term 2: $\langle \phi(\mathbf{x}), \Phi_{rec} \rangle = \sum_{j=1}^k z_j \langle \phi(\mathbf{x}), \mathbf{v}_j \rangle$
Using the dual representation $\mathbf{v}_j = \sum_i \alpha_i^{(j)} \tilde{\phi}(\mathbf{x}i)$: $$= \sum{j=1}^k z_j \sum_i \alpha_i^{(j)} \langle \phi(\mathbf{x}), \tilde{\phi}(\mathbf{x}_i) \rangle$$
This requires computing centered kernel values between $\mathbf{x}$ and all training points.
The Optimization Landscape
For nonlinear kernels like RBF, this objective is typically non-convex with potentially many local minima. The optimization problem: $$\hat{\mathbf{x}} = \arg\max_{\mathbf{x}} \langle \phi(\mathbf{x}), \Phi_{rec} \rangle$$
has no closed-form solution and requires iterative methods.
One of the most widely-used approaches to the pre-image problem is the fixed-point iteration method proposed by Mika et al. and refined by Schölkopf et al.
The Weighted Combination Idea
The key insight is that we can express the pre-image as a weighted combination of training points, where the weights depend on kernel similarities:
$$\hat{\mathbf{x}} = \frac{\sum_{i=1}^n \gamma_i \mathbf{x}i}{\sum{i=1}^n \gamma_i}$$
where $\gamma_i$ measures how similar $\phi(\mathbf{x})$ should be to $\phi(\mathbf{x}_i)$ given the target projection.
Derivation for RBF Kernel
For the RBF kernel $k(\mathbf{x}, \mathbf{y}) = \exp(-|\mathbf{x} - \mathbf{y}|^2 / 2\sigma^2)$, we can derive a fixed-point equation.
The feature-space reconstruction can be written as: $$\Phi_{rec} = \sum_{j=1}^k z_j \mathbf{v}j = \sum{j=1}^k z_j \sum_i \alpha_i^{(j)} \tilde{\phi}(\mathbf{x}_i)$$
Define weights: $$\beta_i = \sum_{j=1}^k z_j \alpha_i^{(j)}$$
Then $\Phi_{rec} = \sum_i \beta_i \tilde{\phi}(\mathbf{x}_i)$ (approximately, ignoring centering for now).
For the RBF kernel, maximizing $\langle \phi(\mathbf{x}), \Phi_{rec} \rangle$ leads to the fixed-point iteration:
$$\mathbf{x}^{(t+1)} = \frac{\sum_{i=1}^n \gamma_i^{(t)} \mathbf{x}i}{\sum{i=1}^n \gamma_i^{(t)}}$$
where $\gamma_i^{(t)} = \beta_i \cdot k(\mathbf{x}^{(t)}, \mathbf{x}_i)$ and $\beta_i = \sum_j z_j \alpha_i^{(j)}$.
Algorithm: Fixed-Point Pre-Image Estimation
Input: Projection $\mathbf{z} = (z_1, \ldots, z_k)$, eigenvectors ${\boldsymbol{\alpha}^{(j)}}$, training data ${\mathbf{x}_i}$
Compute combination weights: $\beta_i = \sum_{j=1}^k z_j \alpha_i^{(j)}$
Initialize: $\mathbf{x}^{(0)} = $ centroid of training data (or random point)
Iterate until convergence:
Output: Approximate pre-image $\hat{\mathbf{x}} = \mathbf{x}^{(T)}$
Convergence Properties
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as np def rbf_kernel(x: np.ndarray, Y: np.ndarray, gamma: float) -> np.ndarray: """Compute RBF kernel between point x and matrix Y.""" diff = Y - x.reshape(1, -1) return np.exp(-gamma * np.sum(diff**2, axis=1)) def fixed_point_preimage( z: np.ndarray, # Target projection (k,) alphas: np.ndarray, # KPCA eigenvectors (n, k) X_train: np.ndarray, # Training data (n, d) gamma: float, # RBF kernel parameter max_iter: int = 100, tol: float = 1e-6, n_restarts: int = 5) -> tuple: """ Estimate pre-image using fixed-point iteration. Parameters: z: Target projection in KPCA space alphas: Normalized eigenvectors from KPCA (columns) X_train: Training data gamma: RBF kernel gamma parameter max_iter: Maximum iterations per restart tol: Convergence tolerance n_restarts: Number of random restarts Returns: x_preimage: Estimated pre-image point best_obj: Best objective value achieved """ n, d = X_train.shape # Compute combination weights: beta_i = sum_j z_j * alpha_i^(j) betas = alphas @ z # Shape: (n,) best_x = None best_obj = -np.inf for restart in range(n_restarts): # Initialize with random training point or centroid if restart == 0: x = X_train.mean(axis=0) # Centroid else: idx = np.random.choice(n) x = X_train[idx].copy() for iteration in range(max_iter): x_old = x.copy() # Compute kernel weights k_vals = rbf_kernel(x, X_train, gamma) gamma_weights = betas * k_vals # Handle all-negative or all-zero weights if gamma_weights.sum() == 0: break # Normalize weights (handle sign issues) # Use absolute value for weighted average abs_gamma = np.abs(gamma_weights) if abs_gamma.sum() > 0: x = (abs_gamma[:, None] * X_train).sum(axis=0) / abs_gamma.sum() # Check convergence if np.linalg.norm(x - x_old) < tol: break # Evaluate objective: correlation with target in feature space k_final = rbf_kernel(x, X_train, gamma) obj = np.sum(betas * k_final) if obj > best_obj: best_obj = obj best_x = x.copy() return best_x, best_obj # Example usagenp.random.seed(42)n = 100d = 5X_train = np.random.randn(n, d) # Fit KPCA (simplified)gamma = 0.5K = np.exp(-gamma * np.sum((X_train[:, None, :] - X_train[None, :, :])**2, axis=2)) # Center kernelrow_means = K.mean(axis=1, keepdims=True)col_means = K.mean(axis=0, keepdims=True)grand_mean = K.mean()K_centered = K - row_means - col_means + grand_mean # Eigendecompositioneigenvalues, eigenvectors = np.linalg.eigh(K_centered)idx = np.argsort(eigenvalues)[::-1]n_components = 3eigenvalues = eigenvalues[idx][:n_components]eigenvectors = eigenvectors[:, idx][:, :n_components]alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10)) # Project a training pointtest_idx = 42z = K_centered[test_idx] @ alphas # Find pre-imagex_preimage, obj = fixed_point_preimage(z, alphas, X_train, gamma) print(f"Original point: {X_train[test_idx][:3]}...")print(f"Pre-image estimate: {x_preimage[:3]}...")print(f"Reconstruction error: {np.linalg.norm(X_train[test_idx] - x_preimage):.4f}")An alternative to fixed-point iteration is to directly minimize the feature-space reconstruction error using gradient descent.
Objective Function
Define the reconstruction error (for RBF kernel where $|\phi(\mathbf{x})|^2 = 1$):
$$L(\mathbf{x}) = | \phi(\mathbf{x}) - \Phi_{rec} |^2 = 1 - 2\langle \phi(\mathbf{x}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$
Since we want to minimize this, and the first and third terms are constant (for RBF), we maximize: $$J(\mathbf{x}) = \langle \phi(\mathbf{x}), \Phi_{rec} \rangle = \sum_i \beta_i \tilde{k}(\mathbf{x}, \mathbf{x}_i)$$
where $\tilde{k}$ denotes the centered kernel evaluation.
Computing the Gradient
For the RBF kernel: $$k(\mathbf{x}, \mathbf{x}_i) = \exp(-\gamma |\mathbf{x} - \mathbf{x}_i|^2)$$
The gradient with respect to $\mathbf{x}$ is: $$\frac{\partial k}{\partial \mathbf{x}} = -2\gamma (\mathbf{x} - \mathbf{x}_i) k(\mathbf{x}, \mathbf{x}_i)$$
For the full objective: $$\nabla_\mathbf{x} J = -2\gamma \sum_i \beta_i (\mathbf{x} - \mathbf{x}_i) k(\mathbf{x}, \mathbf{x}_i)$$
(With additional terms from the centering, which we'll address below.)
For practical purposes, the gradient expression (ignoring centering complications) is:
$$\nabla_\mathbf{x} J \approx -2\gamma \sum_i \beta_i k(\mathbf{x}, \mathbf{x}_i)(\mathbf{x} - \mathbf{x}_i)$$
This can be used in gradient ascent to find local maxima of $J(\mathbf{x})$.
Gradient Ascent Algorithm
Initialize: $\mathbf{x}^{(0)}$ (e.g., training centroid or weighted average)
Iterate:
Stop when gradient magnitude is below threshold or max iterations reached
Advantages Over Fixed-Point Iteration
Disadvantages
Handling Centering in the Gradient
The proper objective uses centered kernel values. Let: $$\tilde{k}(\mathbf{x}, \mathbf{x}_i) = k(\mathbf{x}, \mathbf{x}_i) - \frac{1}{n}\sum_m k(\mathbf{x}_m, \mathbf{x}_i) - \frac{1}{n}\sum_m k(\mathbf{x}, \mathbf{x}_m) + \bar{K}$$
The third term depends on $\mathbf{x}$, adding to the gradient: $$\frac{\partial}{\partial \mathbf{x}}\left[-\frac{1}{n}\sum_m k(\mathbf{x}, \mathbf{x}_m)\right] = \frac{2\gamma}{n}\sum_m k(\mathbf{x}, \mathbf{x}_m)(\mathbf{x} - \mathbf{x}_m)$$
This must be included for correct centering-aware optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as np def gradient_preimage( z: np.ndarray, # Target projection (k,) alphas: np.ndarray, # KPCA eigenvectors (n, k) X_train: np.ndarray, # Training data (n, d) K_train: np.ndarray, # Training kernel matrix (n, n) gamma: float, # RBF kernel parameter lr: float = 0.1, max_iter: int = 500, tol: float = 1e-6, n_restarts: int = 3) -> tuple: """ Estimate pre-image using gradient ascent. Properly handles centering in the objective and gradient. """ n, d = X_train.shape # Compute combination weights betas = alphas @ z # Shape: (n,) # Pre-compute training statistics for centering K_col_means = K_train.mean(axis=0) # (n,) K_grand_mean = K_train.mean() def rbf_and_grad(x): """Compute RBF kernel values and gradients.""" diff = x - X_train # (n, d) sq_dist = np.sum(diff**2, axis=1) # (n,) k_vals = np.exp(-gamma * sq_dist) # (n,) # Gradient: dk/dx = -2*gamma*(x-x_i)*k dk = -2 * gamma * diff * k_vals[:, None] # (n, d) return k_vals, dk def objective_and_gradient(x): """ Compute centered objective and gradient. Centered kernel: k_tilde(x, x_i) = k(x,x_i) - mean_over_training """ k_vals, dk = rbf_and_grad(x) # Centering for test point test_row_mean = k_vals.mean() # Mean of k(x, x_i) over i k_centered = k_vals - K_col_means - test_row_mean + K_grand_mean # Objective: sum_i beta_i * k_tilde(x, x_i) obj = np.dot(betas, k_centered) # Gradient of centered kernel # d(k_centered_i)/dx = dk_i/dx - (1/n)*sum_j dk_j/dx dk_mean = dk.mean(axis=0) dk_centered = dk - dk_mean # (n, d) # Gradient of objective grad = np.dot(betas, dk_centered) # (d,) return obj, grad best_x = None best_obj = -np.inf for restart in range(n_restarts): # Initialize if restart == 0: x = X_train.mean(axis=0) else: x = X_train[np.random.choice(n)].copy() for iteration in range(max_iter): obj, grad = objective_and_gradient(x) # Gradient ascent step x_new = x + lr * grad # Check convergence if np.linalg.norm(x_new - x) < tol: break x = x_new final_obj, _ = objective_and_gradient(x) if final_obj > best_obj: best_obj = final_obj best_x = x.copy() return best_x, best_obj # Example comparison with fixed-point methodnp.random.seed(42)n, d = 100, 5X_train = np.random.randn(n, d)gamma = 0.5 # Compute kernel and KPCAK = np.exp(-gamma * np.sum((X_train[:, None, :] - X_train[None, :, :])**2, axis=2))row_means = K.mean(axis=1, keepdims=True)col_means = K.mean(axis=0, keepdims=True)grand_mean = K.mean()K_centered = K - row_means - col_means + grand_mean eigenvalues, eigenvectors = np.linalg.eigh(K_centered)idx = np.argsort(eigenvalues)[::-1]n_components = 3eigenvalues = eigenvalues[idx][:n_components]eigenvectors = eigenvectors[:, idx][:, :n_components]alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10)) # Project and reconstructtest_idx = 25z = K_centered[test_idx] @ alphas x_gradient, obj_grad = gradient_preimage(z, alphas, X_train, K, gamma) print(f"Original: {X_train[test_idx][:3]}...")print(f"Gradient reconstruction: {x_gradient[:3]}...")print(f"Gradient error: {np.linalg.norm(X_train[test_idx] - x_gradient):.4f}")Pre-image estimation is inherently imperfect. Understanding its limitations is crucial for practical applications.
When Pre-images Work Well
High variance explained: When the top $k$ components capture most of the variance, the reconstruction target $\Phi_{rec}$ is close to the original $\phi(\mathbf{x})$, making good pre-images possible.
Smooth kernels: RBF kernels with appropriate bandwidth produce smooth optimization landscapes with fewer local minima.
Dense training data: When training points densely cover the input domain, pre-images can be well-approximated as weighted combinations.
Points near training data: Pre-images for projections of points near the training manifold are more reliable than for extrapolated points.
When Pre-images Fail
Low variance explained: When many components are discarded, the reconstruction target may be far from any valid feature-space image, and pre-images become unreliable.
Extrapolation: For projections corresponding to points outside the training domain, pre-image estimation often fails badly.
Local minima: Non-convex optimization can converge to poor local solutions.
Ill-conditioning: Certain kernel parameters (very small or large $\gamma$) lead to poorly-conditioned optimization.
For "synthetic" projections—points in KPCA space that don't correspond to any actual input (e.g., for generative applications)—pre-image quality is fundamentally limited. The reconstruction target may not lie near the manifold of valid feature-space images, making any pre-image a poor approximation.
Evaluating Pre-image Quality
Several metrics can assess pre-image quality:
1. Reprojection Error
Project the estimated pre-image back to KPCA space and compare: $$\text{Error} = | \mathbf{z} - \mathbf{z}_{\text{reconstructed}} |$$
Small reprojection error is necessary but not sufficient for good pre-images.
2. Feature-Space Distance
Compute (approximately) the distance in feature space: $$| \phi(\hat{\mathbf{x}}) - \Phi_{rec} |^2 \approx 1 - 2\langle \phi(\hat{\mathbf{x}}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$
3. Input-Space Error (for known pre-images)
When the original input is known, measure: $$| \hat{\mathbf{x}} - \mathbf{x}_{\text{original}} |$$
This is the most meaningful metric but requires ground truth.
4. Consistency Check
For the same projection, run pre-image estimation with different initializations. Large variance indicates unreliable estimates.
| Method | Strengths | Weaknesses | When to Use |
|---|---|---|---|
| Fixed-point iteration | Fast, no tuning needed | May not converge, limited to specific kernels | RBF kernel, fast approximation needed |
| Gradient ascent | General, flexible, can use advanced optimizers | Requires step size tuning, slower | Non-RBF kernels, need more accuracy |
| Multiple restarts | Avoids some local minima | Expensive, no guarantee of global optimum | Critical applications, time available |
| Constrained optimization | Enforces validity constraints | More complex, requires constraint definition | Bounded input domains |
Despite its limitations, pre-image estimation enables important applications of Kernel PCA.
Denoising
The most established application is denoising. The procedure:
The projection "filters out" high-frequency noise that doesn't align with the principal components, and the pre-image provides a cleaned approximation in input space.
Image Denoising Example:
Compression and Reconstruction
KPCA can compress data by storing only the low-dimensional projections:
This is useful when storage is limited and approximate reconstruction is acceptable.
For denoising, choosing the right number of components $k$ is crucial. Too few components lose signal; too many retain noise. Cross-validation on a held-out set with known noise levels can help select $k$.
Feature Extraction and Visualization
Understanding what input features drive KPCA components:
This reveals what input-space variations correspond to each kernel principal component.
Novelty Detection
Pre-image distance can indicate anomalies:
Limitations in Generative Applications
Attempts to use KPCA generatively (sampling in the latent space and finding pre-images) are problematic:
KPCA is fundamentally a discriminative/descriptive tool, not a generative model.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom scipy.optimize import minimize def kpca_denoise( X_noisy: np.ndarray, X_train: np.ndarray, K_train: np.ndarray, K_test: np.ndarray, # Kernel between noisy points and training alphas: np.ndarray, # Normalized KPCA eigenvectors gamma: float, centering_stats: tuple # (col_means, grand_mean) from training) -> np.ndarray: """ Denoise data by projecting to KPCA space and estimating pre-images. Parameters: X_noisy: Noisy input data to denoise (m, d) X_train: Training data for KPCA (n, d) K_train: Training kernel matrix (n, n) K_test: Test-train kernel matrix (m, n) alphas: KPCA eigenvectors (n, k) gamma: RBF kernel gamma centering_stats: Tuple of (column_means, grand_mean) from training Returns: X_denoised: Denoised data (m, d) """ col_means, grand_mean = centering_stats m, d = X_noisy.shape n = X_train.shape[0] # Center test kernel test_row_means = K_test.mean(axis=1, keepdims=True) K_test_centered = K_test - col_means.reshape(1, -1) - test_row_means + grand_mean # Project noisy data to KPCA space Z = K_test_centered @ alphas # (m, k) X_denoised = np.zeros_like(X_noisy) for i in range(m): z = Z[i] betas = alphas @ z # Combination weights # Simple weighted combination as initial estimate # (Fast approximation instead of full optimization) k_vals = np.exp(-gamma * np.sum((X_noisy[i:i+1] - X_train)**2, axis=1)) weights = np.abs(betas) * k_vals if weights.sum() > 0: X_denoised[i] = (weights[:, None] * X_train).sum(axis=0) / weights.sum() else: X_denoised[i] = X_noisy[i] # Fallback return X_denoised # Example: denoising 1D function valuesnp.random.seed(42) # Training data: clean sinusoid samplesn_train = 100t_train = np.linspace(0, 4*np.pi, n_train)X_train = np.column_stack([t_train, np.sin(t_train)]) # Noisy test datan_test = 50t_test = np.linspace(0, 4*np.pi, n_test)noise = np.random.randn(n_test) * 0.3X_noisy = np.column_stack([t_test, np.sin(t_test) + noise])X_clean = np.column_stack([t_test, np.sin(t_test)]) # Compute kernelsgamma = 1.0K_train = np.exp(-gamma * np.sum((X_train[:, None] - X_train[None, :])**2, axis=2))K_test = np.exp(-gamma * np.sum((X_noisy[:, None] - X_train[None, :])**2, axis=2)) # Center training kernel and fit KPCAcol_means = K_train.mean(axis=0)grand_mean = K_train.mean()K_train_cent = K_train - col_means - col_means[:, None] + grand_mean eigenvalues, eigenvectors = np.linalg.eigh(K_train_cent)idx = np.argsort(eigenvalues)[::-1]k = 5 # Keep 5 componentseigenvalues = eigenvalues[idx][:k]eigenvectors = eigenvectors[:, idx][:, :k]alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10)) # DenoiseX_denoised = kpca_denoise(X_noisy, X_train, K_train, K_test, alphas, gamma, (col_means, grand_mean)) # Evaluatenoise_error = np.mean((X_noisy[:, 1] - X_clean[:, 1])**2)denoised_error = np.mean((X_denoised[:, 1] - X_clean[:, 1])**2)print(f"Original MSE (noisy vs clean): {noise_error:.4f}")print(f"Denoised MSE: {denoised_error:.4f}")print(f"Noise reduction: {(1 - denoised_error/noise_error)*100:.1f}%")The pre-image problem is a fundamental challenge in Kernel PCA that arises when we need to map from the reduced representation back to input space. Unlike linear PCA, there is no closed-form solution.
You now understand the pre-image problem: why it's fundamentally difficult, how iterative methods provide approximate solutions, and what applications benefit from pre-image estimation. The final page of this module covers kernel selection—choosing the right kernel function for your data and application.
What's Next:
The final page addresses kernel selection—arguably the most consequential practical decision in Kernel PCA. We'll explore the properties of common kernels, methods for selecting kernel parameters, and strategies for evaluating whether a particular kernel is appropriate for your data.