Machine LearningDimensionality Reduction

Kernel PCA

LevelAdvanced

Duration90 mins

TopicDimensionality Reduction

4 / 5

Pre-image Problem

The Challenge of Inverting Kernel PCA

Kernel PCA elegantly projects data into a reduced representation by operating implicitly in a high-dimensional feature space. But what if we want to go backwards? Given a point $\mathbf{z}$ in the reduced kernel PCA space, can we find a corresponding point $\mathbf{x}$ in the original input space?

This is the pre-image problem, and it presents a fundamental challenge. Unlike linear PCA, where we can trivially reconstruct from projections using the transpose of the projection matrix, Kernel PCA's implicit feature mapping makes reconstruction far more difficult—in general, impossible in closed form.

The pre-image problem is not merely theoretical. It arises naturally in applications:

Denoising: Project noisy data to a low-dimensional kernel subspace, then reconstruct a "cleaned" version
Compression: Store only the low-dimensional representation, then reconstruct as needed
Generative modeling: Sample from the kernel PCA space and generate corresponding inputs
Feature synthesis: Understand what input features produce specific kernel PCA patterns

This page develops the theory of the pre-image problem, explains why exact solutions are typically impossible, and presents practical approximation methods.

What You Will Learn

By the end of this page, you will understand why the pre-image problem is fundamentally ill-posed for nonlinear kernels, how to formulate it as an optimization problem, what iterative methods exist for approximate solutions, and when pre-image estimation is reliable versus unreliable.

Reconstruction in Linear PCA: The Easy Case

Let's first recall how reconstruction works in standard linear PCA, to see exactly what becomes difficult in the kernel case.

Linear PCA Projection

Given centered data $\tilde{\mathbf{x}} = \mathbf{x} - \bar{\mathbf{x}}$ and principal component matrix $\mathbf{V}_k = [\mathbf{v}_1, \ldots, \mathbf{v}_k]$, the projection is: $$\mathbf{z} = \mathbf{V}_k^T \tilde{\mathbf{x}} \in \mathbb{R}^k$$

Linear PCA Reconstruction

To reconstruct (approximately) from the reduced representation: $$\hat{\tilde{\mathbf{x}}} = \mathbf{V}_k \mathbf{z} = \mathbf{V}_k \mathbf{V}_k^T \tilde{\mathbf{x}}$$ $$\hat{\mathbf{x}} = \hat{\tilde{\mathbf{x}}} + \bar{\mathbf{x}}$$

This is exact if we use all $d$ components; with $k < d$ components, we get the best rank-$k$ approximation (in squared error).

Why This Works

We have explicit eigenvectors $\mathbf{v}_i \in \mathbb{R}^d$
The projection $\mathbf{V}_k^T$ has pseudo-inverse $\mathbf{V}_k$
The mapping is linear, so inversion is straightforward

The Linear Reconstruction Property

In linear PCA, the projection matrix $\mathbf{P} = \mathbf{V}_k\mathbf{V}_k^T$ is an orthogonal projector onto the principal subspace. Any point in the input space projects to this subspace, and points within the subspace are their own pre-images. The mapping from subspace to input space is the identity (embedded in higher dimensions).

What Changes with Kernels

In Kernel PCA, the situation is fundamentally different:

Eigenvectors are in feature space: $\mathbf{v}_k \in \mathcal{F}$, not $\mathbb{R}^d$. We can't represent them explicitly.
The mapping $\phi$ is nonlinear and not invertible: Even if we could compute a point in feature space, mapping it back to input space is a separate (hard) problem.
Feature space may be infinite-dimensional: The projection lives in feature space, which can have higher (or infinite) dimension than the input space.
No pseudo-inverse exists in the usual sense: The relationship between input space and feature space is fundamentally asymmetric.

The Core Difficulty

In Kernel PCA, we compute projections of the form: $$z_k = \langle \mathbf{v}k, \phi(\mathbf{x}) \rangle = \sum{i=1}^n \alpha_i^{(k)} k(\mathbf{x}_i, \mathbf{x})$$

Given $\mathbf{z} = (z_1, \ldots, z_k)$, we want $\mathbf{x}$. But this requires inverting both the kernel function and the projection—neither of which has a closed-form inverse.

Formulating the Pre-image Problem

Let's formalize what we're trying to achieve and why it's fundamentally difficult.

The Feature Space Reconstruction

In Kernel PCA, the projection of $\phi(\mathbf{x})$ onto the first $k$ principal components in feature space is: $$\mathbf{P}k \tilde{\phi}(\mathbf{x}) = \sum{j=1}^k \mathbf{v}_j \langle \mathbf{v}j, \tilde{\phi}(\mathbf{x}) \rangle = \sum{j=1}^k z_j \mathbf{v}_j$$

This is a point in $\mathcal{F}$ (the feature space). If we denote this reconstructed point in feature space as $\Phi_{rec}$, the pre-image problem asks:

Find $\mathbf{x}$ such that $\phi(\mathbf{x}) = \Phi_{rec}$ (or as close as possible)

Why Exact Pre-images May Not Exist

The reconstructed point $\Phi_{rec}$ lies in the subspace spanned by ${\mathbf{v}_1, \ldots, \mathbf{v}_k}$. But not every point in feature space corresponds to some input $\mathbf{x}$—only points in the image of $\phi$: $${\phi(\mathbf{x}) : \mathbf{x} \in \mathcal{X}} \subset \mathcal{F}$$

This image is generally a curved manifold in feature space, not a linear subspace. The projection onto principal components typically does not land exactly on this manifold.

The Fundamental Ill-Posedness

For most nonlinear kernels, there is NO input $\mathbf{x}$ such that $\phi(\mathbf{x})$ exactly equals the feature-space reconstruction $\Phi_{rec}$. The pre-image problem is inherently ill-posed—we must settle for approximations.

Approximate Pre-image Formulation

Since exact pre-images typically don't exist, we seek approximate solutions:

$$\hat{\mathbf{x}} = \arg\min_{\mathbf{x} \in \mathcal{X}} | \phi(\mathbf{x}) - \Phi_{rec} |_{\mathcal{F}}^2$$

This minimizes the squared distance in feature space between the image of $\mathbf{x}$ and the reconstruction target.

Expanding the Objective

$$| \phi(\mathbf{x}) - \Phi_{rec} |^2 = | \phi(\mathbf{x}) |^2 - 2 \langle \phi(\mathbf{x}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$

The third term is constant with respect to $\mathbf{x}$. Let's analyze the first two:

Term 1: $|\phi(\mathbf{x})|^2 = k(\mathbf{x}, \mathbf{x})$

For the RBF kernel, $k(\mathbf{x}, \mathbf{x}) = 1$ for all $\mathbf{x}$, so this is constant.

Term 2: $\langle \phi(\mathbf{x}), \Phi_{rec} \rangle = \sum_{j=1}^k z_j \langle \phi(\mathbf{x}), \mathbf{v}_j \rangle$

Using the dual representation $\mathbf{v}_j = \sum_i \alpha_i^{(j)} \tilde{\phi}(\mathbf{x}i)$: $$= \sum{j=1}^k z_j \sum_i \alpha_i^{(j)} \langle \phi(\mathbf{x}), \tilde{\phi}(\mathbf{x}_i) \rangle$$

This requires computing centered kernel values between $\mathbf{x}$ and all training points.

The Optimization Landscape

For nonlinear kernels like RBF, this objective is typically non-convex with potentially many local minima. The optimization problem: $$\hat{\mathbf{x}} = \arg\max_{\mathbf{x}} \langle \phi(\mathbf{x}), \Phi_{rec} \rangle$$

has no closed-form solution and requires iterative methods.

Fixed-Point Iteration for Pre-images

One of the most widely-used approaches to the pre-image problem is the fixed-point iteration method proposed by Mika et al. and refined by Schölkopf et al.

The Weighted Combination Idea

The key insight is that we can express the pre-image as a weighted combination of training points, where the weights depend on kernel similarities:

$$\hat{\mathbf{x}} = \frac{\sum_{i=1}^n \gamma_i \mathbf{x}i}{\sum{i=1}^n \gamma_i}$$

where $\gamma_i$ measures how similar $\phi(\mathbf{x})$ should be to $\phi(\mathbf{x}_i)$ given the target projection.

Derivation for RBF Kernel

For the RBF kernel $k(\mathbf{x}, \mathbf{y}) = \exp(-|\mathbf{x} - \mathbf{y}|^2 / 2\sigma^2)$, we can derive a fixed-point equation.

The feature-space reconstruction can be written as: $$\Phi_{rec} = \sum_{j=1}^k z_j \mathbf{v}j = \sum{j=1}^k z_j \sum_i \alpha_i^{(j)} \tilde{\phi}(\mathbf{x}_i)$$

Define weights: $$\beta_i = \sum_{j=1}^k z_j \alpha_i^{(j)}$$

Then $\Phi_{rec} = \sum_i \beta_i \tilde{\phi}(\mathbf{x}_i)$ (approximately, ignoring centering for now).

The Fixed-Point Equation

For the RBF kernel, maximizing $\langle \phi(\mathbf{x}), \Phi_{rec} \rangle$ leads to the fixed-point iteration:

$$\mathbf{x}^{(t+1)} = \frac{\sum_{i=1}^n \gamma_i^{(t)} \mathbf{x}i}{\sum{i=1}^n \gamma_i^{(t)}}$$

where $\gamma_i^{(t)} = \beta_i \cdot k(\mathbf{x}^{(t)}, \mathbf{x}_i)$ and $\beta_i = \sum_j z_j \alpha_i^{(j)}$.

Algorithm: Fixed-Point Pre-Image Estimation

Input: Projection $\mathbf{z} = (z_1, \ldots, z_k)$, eigenvectors ${\boldsymbol{\alpha}^{(j)}}$, training data ${\mathbf{x}_i}$
Compute combination weights: $\beta_i = \sum_{j=1}^k z_j \alpha_i^{(j)}$
Initialize: $\mathbf{x}^{(0)} = $ centroid of training data (or random point)
Iterate until convergence:
- Compute kernel weights: $\gamma_i^{(t)} = \beta_i \cdot k(\mathbf{x}^{(t)}, \mathbf{x}_i)$
- Update: $\mathbf{x}^{(t+1)} = \frac{\sum_i \gamma_i^{(t)} \mathbf{x}_i}{\sum_i \gamma_i^{(t)}}$
Output: Approximate pre-image $\hat{\mathbf{x}} = \mathbf{x}^{(T)}$

Convergence Properties

The iteration often converges to a local minimum
Convergence is not guaranteed for all initializations
Multiple restarts may be needed to find good solutions
The result can be sensitive to the kernel bandwidth $\sigma$

fixed_point_preimage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
 
def rbf_kernel(x: np.ndarray, Y: np.ndarray, gamma: float) -> np.ndarray:
    """Compute RBF kernel between point x and matrix Y."""
    diff = Y - x.reshape(1, -1)
    return np.exp(-gamma * np.sum(diff**2, axis=1))
 
def fixed_point_preimage(
    z: np.ndarray,                    # Target projection (k,)
    alphas: np.ndarray,               # KPCA eigenvectors (n, k)
    X_train: np.ndarray,              # Training data (n, d)
    gamma: float,                     # RBF kernel parameter
    max_iter: int = 100,
    tol: float = 1e-6,
    n_restarts: int = 5
) -> tuple:
    """
    Estimate pre-image using fixed-point iteration.
    
    Parameters:
        z: Target projection in KPCA space
        alphas: Normalized eigenvectors from KPCA (columns)
        X_train: Training data
        gamma: RBF kernel gamma parameter
        max_iter: Maximum iterations per restart
        tol: Convergence tolerance
        n_restarts: Number of random restarts
        
    Returns:
        x_preimage: Estimated pre-image point
        best_obj: Best objective value achieved
    """
    n, d = X_train.shape
    
    # Compute combination weights: beta_i = sum_j z_j * alpha_i^(j)
    betas = alphas @ z  # Shape: (n,)
    
    best_x = None
    best_obj = -np.inf
    
    for restart in range(n_restarts):
        # Initialize with random training point or centroid
        if restart == 0:
            x = X_train.mean(axis=0)  # Centroid
        else:
            idx = np.random.choice(n)
            x = X_train[idx].copy()
        
        for iteration in range(max_iter):
            x_old = x.copy()
            
            # Compute kernel weights
            k_vals = rbf_kernel(x, X_train, gamma)
            gamma_weights = betas * k_vals
            
            # Handle all-negative or all-zero weights
            if gamma_weights.sum() == 0:
                break
                
            # Normalize weights (handle sign issues)
            # Use absolute value for weighted average
            abs_gamma = np.abs(gamma_weights)
            if abs_gamma.sum() > 0:
                x = (abs_gamma[:, None] * X_train).sum(axis=0) / abs_gamma.sum()
            
            # Check convergence
            if np.linalg.norm(x - x_old) < tol:
                break
        
        # Evaluate objective: correlation with target in feature space
        k_final = rbf_kernel(x, X_train, gamma)
        obj = np.sum(betas * k_final)
        
        if obj > best_obj:
            best_obj = obj
            best_x = x.copy()
    
    return best_x, best_obj
 
 
# Example usage
np.random.seed(42)
n = 100
d = 5
X_train = np.random.randn(n, d)
 
# Fit KPCA (simplified)
gamma = 0.5
K = np.exp(-gamma * np.sum((X_train[:, None, :] - X_train[None, :, :])**2, axis=2))
 
# Center kernel
row_means = K.mean(axis=1, keepdims=True)
col_means = K.mean(axis=0, keepdims=True)
grand_mean = K.mean()
K_centered = K - row_means - col_means + grand_mean
 
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
idx = np.argsort(eigenvalues)[::-1]
n_components = 3
eigenvalues = eigenvalues[idx][:n_components]
eigenvectors = eigenvectors[:, idx][:, :n_components]
alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
 
# Project a training point
test_idx = 42
z = K_centered[test_idx] @ alphas
 
# Find pre-image
x_preimage, obj = fixed_point_preimage(z, alphas, X_train, gamma)
 
print(f"Original point: {X_train[test_idx][:3]}...")
print(f"Pre-image estimate: {x_preimage[:3]}...")
print(f"Reconstruction error: {np.linalg.norm(X_train[test_idx] - x_preimage):.4f}")

Gradient-Based Optimization Methods

An alternative to fixed-point iteration is to directly minimize the feature-space reconstruction error using gradient descent.

Objective Function

Define the reconstruction error (for RBF kernel where $|\phi(\mathbf{x})|^2 = 1$):

$$L(\mathbf{x}) = | \phi(\mathbf{x}) - \Phi_{rec} |^2 = 1 - 2\langle \phi(\mathbf{x}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$

Since we want to minimize this, and the first and third terms are constant (for RBF), we maximize: $$J(\mathbf{x}) = \langle \phi(\mathbf{x}), \Phi_{rec} \rangle = \sum_i \beta_i \tilde{k}(\mathbf{x}, \mathbf{x}_i)$$

where $\tilde{k}$ denotes the centered kernel evaluation.

Computing the Gradient

For the RBF kernel: $$k(\mathbf{x}, \mathbf{x}_i) = \exp(-\gamma |\mathbf{x} - \mathbf{x}_i|^2)$$

The gradient with respect to $\mathbf{x}$ is: $$\frac{\partial k}{\partial \mathbf{x}} = -2\gamma (\mathbf{x} - \mathbf{x}_i) k(\mathbf{x}, \mathbf{x}_i)$$

For the full objective: $$\nabla_\mathbf{x} J = -2\gamma \sum_i \beta_i (\mathbf{x} - \mathbf{x}_i) k(\mathbf{x}, \mathbf{x}_i)$$

(With additional terms from the centering, which we'll address below.)

Simplified Gradient (Ignoring Centering)

For practical purposes, the gradient expression (ignoring centering complications) is:

$$\nabla_\mathbf{x} J \approx -2\gamma \sum_i \beta_i k(\mathbf{x}, \mathbf{x}_i)(\mathbf{x} - \mathbf{x}_i)$$

This can be used in gradient ascent to find local maxima of $J(\mathbf{x})$.

Gradient Ascent Algorithm

Initialize: $\mathbf{x}^{(0)}$ (e.g., training centroid or weighted average)
Iterate:
- Compute kernel values: $k_i = k(\mathbf{x}^{(t)}, \mathbf{x}_i)$
- Compute gradient: $\nabla J = -2\gamma \sum_i \beta_i k_i (\mathbf{x}^{(t)} - \mathbf{x}_i)$
- Update: $\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \eta \nabla J$
- (Optional) Line search for step size $\eta$
Stop when gradient magnitude is below threshold or max iterations reached

Advantages Over Fixed-Point Iteration

More general: works for any differentiable kernel
Better theoretical grounding in optimization
Can use advanced optimizers (Adam, L-BFGS)
Easier to incorporate constraints

Disadvantages

Requires tuning step size
May converge slowly near local optima
Gradient computation can be expensive

Handling Centering in the Gradient

The proper objective uses centered kernel values. Let: $$\tilde{k}(\mathbf{x}, \mathbf{x}_i) = k(\mathbf{x}, \mathbf{x}_i) - \frac{1}{n}\sum_m k(\mathbf{x}_m, \mathbf{x}_i) - \frac{1}{n}\sum_m k(\mathbf{x}, \mathbf{x}_m) + \bar{K}$$

The third term depends on $\mathbf{x}$, adding to the gradient: $$\frac{\partial}{\partial \mathbf{x}}\left[-\frac{1}{n}\sum_m k(\mathbf{x}, \mathbf{x}_m)\right] = \frac{2\gamma}{n}\sum_m k(\mathbf{x}, \mathbf{x}_m)(\mathbf{x} - \mathbf{x}_m)$$

This must be included for correct centering-aware optimization.

gradient_preimage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
 
def gradient_preimage(
    z: np.ndarray,                    # Target projection (k,)
    alphas: np.ndarray,               # KPCA eigenvectors (n, k)
    X_train: np.ndarray,              # Training data (n, d)
    K_train: np.ndarray,              # Training kernel matrix (n, n)
    gamma: float,                     # RBF kernel parameter
    lr: float = 0.1,
    max_iter: int = 500,
    tol: float = 1e-6,
    n_restarts: int = 3
) -> tuple:
    """
    Estimate pre-image using gradient ascent.
    
    Properly handles centering in the objective and gradient.
    """
    n, d = X_train.shape
    
    # Compute combination weights
    betas = alphas @ z  # Shape: (n,)
    
    # Pre-compute training statistics for centering
    K_col_means = K_train.mean(axis=0)  # (n,)
    K_grand_mean = K_train.mean()
    
    def rbf_and_grad(x):
        """Compute RBF kernel values and gradients."""
        diff = x - X_train  # (n, d)
        sq_dist = np.sum(diff**2, axis=1)  # (n,)
        k_vals = np.exp(-gamma * sq_dist)  # (n,)
        # Gradient: dk/dx = -2*gamma*(x-x_i)*k
        dk = -2 * gamma * diff * k_vals[:, None]  # (n, d)
        return k_vals, dk
    
    def objective_and_gradient(x):
        """
        Compute centered objective and gradient.
        
        Centered kernel: k_tilde(x, x_i) = k(x,x_i) - mean_over_training
        """
        k_vals, dk = rbf_and_grad(x)
        
        # Centering for test point
        test_row_mean = k_vals.mean()  # Mean of k(x, x_i) over i
        k_centered = k_vals - K_col_means - test_row_mean + K_grand_mean
        
        # Objective: sum_i beta_i * k_tilde(x, x_i)
        obj = np.dot(betas, k_centered)
        
        # Gradient of centered kernel
        # d(k_centered_i)/dx = dk_i/dx - (1/n)*sum_j dk_j/dx
        dk_mean = dk.mean(axis=0)
        dk_centered = dk - dk_mean  # (n, d)
        
        # Gradient of objective
        grad = np.dot(betas, dk_centered)  # (d,)
        
        return obj, grad
    
    best_x = None
    best_obj = -np.inf
    
    for restart in range(n_restarts):
        # Initialize
        if restart == 0:
            x = X_train.mean(axis=0)
        else:
            x = X_train[np.random.choice(n)].copy()
        
        for iteration in range(max_iter):
            obj, grad = objective_and_gradient(x)
            
            # Gradient ascent step
            x_new = x + lr * grad
            
            # Check convergence
            if np.linalg.norm(x_new - x) < tol:
                break
            
            x = x_new
        
        final_obj, _ = objective_and_gradient(x)
        if final_obj > best_obj:
            best_obj = final_obj
            best_x = x.copy()
    
    return best_x, best_obj
 
 
# Example comparison with fixed-point method
np.random.seed(42)
n, d = 100, 5
X_train = np.random.randn(n, d)
gamma = 0.5
 
# Compute kernel and KPCA
K = np.exp(-gamma * np.sum((X_train[:, None, :] - X_train[None, :, :])**2, axis=2))
row_means = K.mean(axis=1, keepdims=True)
col_means = K.mean(axis=0, keepdims=True)
grand_mean = K.mean()
K_centered = K - row_means - col_means + grand_mean
 
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
idx = np.argsort(eigenvalues)[::-1]
n_components = 3
eigenvalues = eigenvalues[idx][:n_components]
eigenvectors = eigenvectors[:, idx][:, :n_components]
alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
 
# Project and reconstruct
test_idx = 25
z = K_centered[test_idx] @ alphas
 
x_gradient, obj_grad = gradient_preimage(z, alphas, X_train, K, gamma)
 
print(f"Original: {X_train[test_idx][:3]}...")
print(f"Gradient reconstruction: {x_gradient[:3]}...")
print(f"Gradient error: {np.linalg.norm(X_train[test_idx] - x_gradient):.4f}")

Limitations and Practical Considerations

Pre-image estimation is inherently imperfect. Understanding its limitations is crucial for practical applications.

When Pre-images Work Well

High variance explained: When the top $k$ components capture most of the variance, the reconstruction target $\Phi_{rec}$ is close to the original $\phi(\mathbf{x})$, making good pre-images possible.
Smooth kernels: RBF kernels with appropriate bandwidth produce smooth optimization landscapes with fewer local minima.
Dense training data: When training points densely cover the input domain, pre-images can be well-approximated as weighted combinations.
Points near training data: Pre-images for projections of points near the training manifold are more reliable than for extrapolated points.

When Pre-images Fail

Low variance explained: When many components are discarded, the reconstruction target may be far from any valid feature-space image, and pre-images become unreliable.
Extrapolation: For projections corresponding to points outside the training domain, pre-image estimation often fails badly.
Local minima: Non-convex optimization can converge to poor local solutions.
Ill-conditioning: Certain kernel parameters (very small or large $\gamma$) lead to poorly-conditioned optimization.

Critical Limitation: Synthetic Points

For "synthetic" projections—points in KPCA space that don't correspond to any actual input (e.g., for generative applications)—pre-image quality is fundamentally limited. The reconstruction target may not lie near the manifold of valid feature-space images, making any pre-image a poor approximation.

Evaluating Pre-image Quality

Several metrics can assess pre-image quality:

1. Reprojection Error

Project the estimated pre-image back to KPCA space and compare: $$\text{Error} = | \mathbf{z} - \mathbf{z}_{\text{reconstructed}} |$$

Small reprojection error is necessary but not sufficient for good pre-images.

2. Feature-Space Distance

Compute (approximately) the distance in feature space: $$| \phi(\hat{\mathbf{x}}) - \Phi_{rec} |^2 \approx 1 - 2\langle \phi(\hat{\mathbf{x}}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$

3. Input-Space Error (for known pre-images)

When the original input is known, measure: $$| \hat{\mathbf{x}} - \mathbf{x}_{\text{original}} |$$

This is the most meaningful metric but requires ground truth.

4. Consistency Check

For the same projection, run pre-image estimation with different initializations. Large variance indicates unreliable estimates.

Pre-image Method Comparison
Method	Strengths	Weaknesses	When to Use
Fixed-point iteration	Fast, no tuning needed	May not converge, limited to specific kernels	RBF kernel, fast approximation needed
Gradient ascent	General, flexible, can use advanced optimizers	Requires step size tuning, slower	Non-RBF kernels, need more accuracy
Multiple restarts	Avoids some local minima	Expensive, no guarantee of global optimum	Critical applications, time available
Constrained optimization	Enforces validity constraints	More complex, requires constraint definition	Bounded input domains

Applications of Pre-image Estimation

Despite its limitations, pre-image estimation enables important applications of Kernel PCA.

Denoising

The most established application is denoising. The procedure:

Train KPCA on clean data (or data assumed to contain signal + noise)
For noisy input $\mathbf{x}_{\text{noisy}}$, project to top-$k$ KPCA components
Estimate pre-image of this projection: $\hat{\mathbf{x}}_{\text{clean}}$

The projection "filters out" high-frequency noise that doesn't align with the principal components, and the pre-image provides a cleaned approximation in input space.

Image Denoising Example:

Train KPCA on clean face images
Project noisy face images to the principal subspace
Reconstruct via pre-image
The reconstruction preserves face structure while reducing noise

Compression and Reconstruction

KPCA can compress data by storing only the low-dimensional projections:

Project data: $\mathbf{z} = \text{KPCA}(\mathbf{x})$
Store $\mathbf{z}$ (much smaller than $\mathbf{x}$)
Reconstruct when needed via pre-image estimation

This is useful when storage is limited and approximate reconstruction is acceptable.

Denoising Quality Depends on Component Selection

For denoising, choosing the right number of components $k$ is crucial. Too few components lose signal; too many retain noise. Cross-validation on a held-out set with known noise levels can help select $k$.

Feature Extraction and Visualization

Understanding what input features drive KPCA components:

Vary coordinates in KPCA space along a single component
Estimate pre-images at each point
Visualize how input space changes

This reveals what input-space variations correspond to each kernel principal component.

Novelty Detection

Pre-image distance can indicate anomalies:

Train KPCA on normal data
For new point $\mathbf{x}$:
- Project to KPCA space
- Estimate pre-image $\hat{\mathbf{x}}$
- Compute $| \mathbf{x} - \hat{\mathbf{x}} |$
Large distance indicates the point doesn't fit the learned manifold

Limitations in Generative Applications

Attempts to use KPCA generatively (sampling in the latent space and finding pre-images) are problematic:

The KPCA space is not designed for generation
Arbitrary points in the latent space may have no valid pre-image
Better alternatives: VAEs, GANs, diffusion models

KPCA is fundamentally a discriminative/descriptive tool, not a generative model.

kpca_denoising.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.optimize import minimize
 
def kpca_denoise(
    X_noisy: np.ndarray,
    X_train: np.ndarray,
    K_train: np.ndarray,
    K_test: np.ndarray,       # Kernel between noisy points and training
    alphas: np.ndarray,       # Normalized KPCA eigenvectors
    gamma: float,
    centering_stats: tuple    # (col_means, grand_mean) from training
) -> np.ndarray:
    """
    Denoise data by projecting to KPCA space and estimating pre-images.
    
    Parameters:
        X_noisy: Noisy input data to denoise (m, d)
        X_train: Training data for KPCA (n, d)
        K_train: Training kernel matrix (n, n)
        K_test: Test-train kernel matrix (m, n)
        alphas: KPCA eigenvectors (n, k)
        gamma: RBF kernel gamma
        centering_stats: Tuple of (column_means, grand_mean) from training
        
    Returns:
        X_denoised: Denoised data (m, d)
    """
    col_means, grand_mean = centering_stats
    m, d = X_noisy.shape
    n = X_train.shape[0]
    
    # Center test kernel
    test_row_means = K_test.mean(axis=1, keepdims=True)
    K_test_centered = K_test - col_means.reshape(1, -1) - test_row_means + grand_mean
    
    # Project noisy data to KPCA space
    Z = K_test_centered @ alphas  # (m, k)
    
    X_denoised = np.zeros_like(X_noisy)
    
    for i in range(m):
        z = Z[i]
        betas = alphas @ z  # Combination weights
        
        # Simple weighted combination as initial estimate
        # (Fast approximation instead of full optimization)
        k_vals = np.exp(-gamma * np.sum((X_noisy[i:i+1] - X_train)**2, axis=1))
        weights = np.abs(betas) * k_vals
        if weights.sum() > 0:
            X_denoised[i] = (weights[:, None] * X_train).sum(axis=0) / weights.sum()
        else:
            X_denoised[i] = X_noisy[i]  # Fallback
    
    return X_denoised
 
 
# Example: denoising 1D function values
np.random.seed(42)
 
# Training data: clean sinusoid samples
n_train = 100
t_train = np.linspace(0, 4*np.pi, n_train)
X_train = np.column_stack([t_train, np.sin(t_train)])
 
# Noisy test data
n_test = 50
t_test = np.linspace(0, 4*np.pi, n_test)
noise = np.random.randn(n_test) * 0.3
X_noisy = np.column_stack([t_test, np.sin(t_test) + noise])
X_clean = np.column_stack([t_test, np.sin(t_test)])
 
# Compute kernels
gamma = 1.0
K_train = np.exp(-gamma * np.sum((X_train[:, None] - X_train[None, :])**2, axis=2))
K_test = np.exp(-gamma * np.sum((X_noisy[:, None] - X_train[None, :])**2, axis=2))
 
# Center training kernel and fit KPCA
col_means = K_train.mean(axis=0)
grand_mean = K_train.mean()
K_train_cent = K_train - col_means - col_means[:, None] + grand_mean
 
eigenvalues, eigenvectors = np.linalg.eigh(K_train_cent)
idx = np.argsort(eigenvalues)[::-1]
k = 5  # Keep 5 components
eigenvalues = eigenvalues[idx][:k]
eigenvectors = eigenvectors[:, idx][:, :k]
alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
 
# Denoise
X_denoised = kpca_denoise(X_noisy, X_train, K_train, K_test, alphas, 
                          gamma, (col_means, grand_mean))
 
# Evaluate
noise_error = np.mean((X_noisy[:, 1] - X_clean[:, 1])**2)
denoised_error = np.mean((X_denoised[:, 1] - X_clean[:, 1])**2)
print(f"Original MSE (noisy vs clean): {noise_error:.4f}")
print(f"Denoised MSE: {denoised_error:.4f}")
print(f"Noise reduction: {(1 - denoised_error/noise_error)*100:.1f}%")

Summary

The pre-image problem is a fundamental challenge in Kernel PCA that arises when we need to map from the reduced representation back to input space. Unlike linear PCA, there is no closed-form solution.

Key Takeaways

•Linear PCA reconstruction is trivial; Kernel PCA reconstruction is not — The nonlinear feature mapping has no simple inverse
•Exact pre-images typically don't exist — The projection $\Phi_{rec}$ usually doesn't lie on the image of $\phi$
•Fixed-point iteration provides fast approximate pre-images — Works well for RBF kernels with good initialization
•Gradient-based methods offer more generality — Applicable to any differentiable kernel with proper handling of centering
•Pre-image quality depends on explained variance — More components retained means better pre-image estimates
•Primary application is denoising — Project to principal subspace, estimate pre-image to remove noise

Pre-image Problem Understood

You now understand the pre-image problem: why it's fundamentally difficult, how iterative methods provide approximate solutions, and what applications benefit from pre-image estimation. The final page of this module covers kernel selection—choosing the right kernel function for your data and application.

What's Next:

The final page addresses kernel selection—arguably the most consequential practical decision in Kernel PCA. We'll explore the properties of common kernels, methods for selecting kernel parameters, and strategies for evaluating whether a particular kernel is appropriate for your data.

4 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Kernel PCA

LevelAdvanced

Duration90 mins

TopicDimensionality Reduction

4 / 5

Pre-image Problem

The Challenge of Inverting Kernel PCA

The pre-image problem is not merely theoretical. It arises naturally in applications:

Denoising: Project noisy data to a low-dimensional kernel subspace, then reconstruct a "cleaned" version
Compression: Store only the low-dimensional representation, then reconstruct as needed
Generative modeling: Sample from the kernel PCA space and generate corresponding inputs
Feature synthesis: Understand what input features produce specific kernel PCA patterns

This page develops the theory of the pre-image problem, explains why exact solutions are typically impossible, and presents practical approximation methods.

What You Will Learn

Reconstruction in Linear PCA: The Easy Case

Let's first recall how reconstruction works in standard linear PCA, to see exactly what becomes difficult in the kernel case.

Linear PCA Projection

Linear PCA Reconstruction

This is exact if we use all $d$ components; with $k < d$ components, we get the best rank-$k$ approximation (in squared error).

Why This Works

We have explicit eigenvectors $\mathbf{v}_i \in \mathbb{R}^d$
The projection $\mathbf{V}_k^T$ has pseudo-inverse $\mathbf{V}_k$
The mapping is linear, so inversion is straightforward

The Linear Reconstruction Property

What Changes with Kernels

In Kernel PCA, the situation is fundamentally different:

Eigenvectors are in feature space: $\mathbf{v}_k \in \mathcal{F}$, not $\mathbb{R}^d$. We can't represent them explicitly.
The mapping $\phi$ is nonlinear and not invertible: Even if we could compute a point in feature space, mapping it back to input space is a separate (hard) problem.
Feature space may be infinite-dimensional: The projection lives in feature space, which can have higher (or infinite) dimension than the input space.
No pseudo-inverse exists in the usual sense: The relationship between input space and feature space is fundamentally asymmetric.

The Core Difficulty

In Kernel PCA, we compute projections of the form: $$z_k = \langle \mathbf{v}k, \phi(\mathbf{x}) \rangle = \sum{i=1}^n \alpha_i^{(k)} k(\mathbf{x}_i, \mathbf{x})$$

Given $\mathbf{z} = (z_1, \ldots, z_k)$, we want $\mathbf{x}$. But this requires inverting both the kernel function and the projection—neither of which has a closed-form inverse.

Formulating the Pre-image Problem

Let's formalize what we're trying to achieve and why it's fundamentally difficult.

The Feature Space Reconstruction

This is a point in $\mathcal{F}$ (the feature space). If we denote this reconstructed point in feature space as $\Phi_{rec}$, the pre-image problem asks:

Find $\mathbf{x}$ such that $\phi(\mathbf{x}) = \Phi_{rec}$ (or as close as possible)

Why Exact Pre-images May Not Exist

This image is generally a curved manifold in feature space, not a linear subspace. The projection onto principal components typically does not land exactly on this manifold.

The Fundamental Ill-Posedness

Approximate Pre-image Formulation

Since exact pre-images typically don't exist, we seek approximate solutions:

$$\hat{\mathbf{x}} = \arg\min_{\mathbf{x} \in \mathcal{X}} | \phi(\mathbf{x}) - \Phi_{rec} |_{\mathcal{F}}^2$$

This minimizes the squared distance in feature space between the image of $\mathbf{x}$ and the reconstruction target.

Expanding the Objective

$$| \phi(\mathbf{x}) - \Phi_{rec} |^2 = | \phi(\mathbf{x}) |^2 - 2 \langle \phi(\mathbf{x}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$

The third term is constant with respect to $\mathbf{x}$. Let's analyze the first two:

Term 1: $|\phi(\mathbf{x})|^2 = k(\mathbf{x}, \mathbf{x})$

For the RBF kernel, $k(\mathbf{x}, \mathbf{x}) = 1$ for all $\mathbf{x}$, so this is constant.

Term 2: $\langle \phi(\mathbf{x}), \Phi_{rec} \rangle = \sum_{j=1}^k z_j \langle \phi(\mathbf{x}), \mathbf{v}_j \rangle$

This requires computing centered kernel values between $\mathbf{x}$ and all training points.

The Optimization Landscape

has no closed-form solution and requires iterative methods.

Fixed-Point Iteration for Pre-images

One of the most widely-used approaches to the pre-image problem is the fixed-point iteration method proposed by Mika et al. and refined by Schölkopf et al.

The Weighted Combination Idea

The key insight is that we can express the pre-image as a weighted combination of training points, where the weights depend on kernel similarities:

$$\hat{\mathbf{x}} = \frac{\sum_{i=1}^n \gamma_i \mathbf{x}i}{\sum{i=1}^n \gamma_i}$$

where $\gamma_i$ measures how similar $\phi(\mathbf{x})$ should be to $\phi(\mathbf{x}_i)$ given the target projection.

Derivation for RBF Kernel

For the RBF kernel $k(\mathbf{x}, \mathbf{y}) = \exp(-|\mathbf{x} - \mathbf{y}|^2 / 2\sigma^2)$, we can derive a fixed-point equation.

The feature-space reconstruction can be written as: $$\Phi_{rec} = \sum_{j=1}^k z_j \mathbf{v}j = \sum{j=1}^k z_j \sum_i \alpha_i^{(j)} \tilde{\phi}(\mathbf{x}_i)$$

Define weights: $$\beta_i = \sum_{j=1}^k z_j \alpha_i^{(j)}$$

Then $\Phi_{rec} = \sum_i \beta_i \tilde{\phi}(\mathbf{x}_i)$ (approximately, ignoring centering for now).

The Fixed-Point Equation

For the RBF kernel, maximizing $\langle \phi(\mathbf{x}), \Phi_{rec} \rangle$ leads to the fixed-point iteration:

$$\mathbf{x}^{(t+1)} = \frac{\sum_{i=1}^n \gamma_i^{(t)} \mathbf{x}i}{\sum{i=1}^n \gamma_i^{(t)}}$$

where $\gamma_i^{(t)} = \beta_i \cdot k(\mathbf{x}^{(t)}, \mathbf{x}_i)$ and $\beta_i = \sum_j z_j \alpha_i^{(j)}$.

Algorithm: Fixed-Point Pre-Image Estimation

Input: Projection $\mathbf{z} = (z_1, \ldots, z_k)$, eigenvectors ${\boldsymbol{\alpha}^{(j)}}$, training data ${\mathbf{x}_i}$
Compute combination weights: $\beta_i = \sum_{j=1}^k z_j \alpha_i^{(j)}$
Initialize: $\mathbf{x}^{(0)} = $ centroid of training data (or random point)
Iterate until convergence:
- Compute kernel weights: $\gamma_i^{(t)} = \beta_i \cdot k(\mathbf{x}^{(t)}, \mathbf{x}_i)$
- Update: $\mathbf{x}^{(t+1)} = \frac{\sum_i \gamma_i^{(t)} \mathbf{x}_i}{\sum_i \gamma_i^{(t)}}$
Output: Approximate pre-image $\hat{\mathbf{x}} = \mathbf{x}^{(T)}$

Convergence Properties

The iteration often converges to a local minimum
Convergence is not guaranteed for all initializations
Multiple restarts may be needed to find good solutions
The result can be sensitive to the kernel bandwidth $\sigma$

fixed_point_preimage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
 
def rbf_kernel(x: np.ndarray, Y: np.ndarray, gamma: float) -> np.ndarray:
    """Compute RBF kernel between point x and matrix Y."""
    diff = Y - x.reshape(1, -1)
    return np.exp(-gamma * np.sum(diff**2, axis=1))
 
def fixed_point_preimage(
    z: np.ndarray,                    # Target projection (k,)
    alphas: np.ndarray,               # KPCA eigenvectors (n, k)
    X_train: np.ndarray,              # Training data (n, d)
    gamma: float,                     # RBF kernel parameter
    max_iter: int = 100,
    tol: float = 1e-6,
    n_restarts: int = 5
) -> tuple:
    """
    Estimate pre-image using fixed-point iteration.
    
    Parameters:
        z: Target projection in KPCA space
        alphas: Normalized eigenvectors from KPCA (columns)
        X_train: Training data
        gamma: RBF kernel gamma parameter
        max_iter: Maximum iterations per restart
        tol: Convergence tolerance
        n_restarts: Number of random restarts
        
    Returns:
        x_preimage: Estimated pre-image point
        best_obj: Best objective value achieved
    """
    n, d = X_train.shape
    
    # Compute combination weights: beta_i = sum_j z_j * alpha_i^(j)
    betas = alphas @ z  # Shape: (n,)
    
    best_x = None
    best_obj = -np.inf
    
    for restart in range(n_restarts):
        # Initialize with random training point or centroid
        if restart == 0:
            x = X_train.mean(axis=0)  # Centroid
        else:
            idx = np.random.choice(n)
            x = X_train[idx].copy()
        
        for iteration in range(max_iter):
            x_old = x.copy()
            
            # Compute kernel weights
            k_vals = rbf_kernel(x, X_train, gamma)
            gamma_weights = betas * k_vals
            
            # Handle all-negative or all-zero weights
            if gamma_weights.sum() == 0:
                break
                
            # Normalize weights (handle sign issues)
            # Use absolute value for weighted average
            abs_gamma = np.abs(gamma_weights)
            if abs_gamma.sum() > 0:
                x = (abs_gamma[:, None] * X_train).sum(axis=0) / abs_gamma.sum()
            
            # Check convergence
            if np.linalg.norm(x - x_old) < tol:
                break
        
        # Evaluate objective: correlation with target in feature space
        k_final = rbf_kernel(x, X_train, gamma)
        obj = np.sum(betas * k_final)
        
        if obj > best_obj:
            best_obj = obj
            best_x = x.copy()
    
    return best_x, best_obj
 
 
# Example usage
np.random.seed(42)
n = 100
d = 5
X_train = np.random.randn(n, d)
 
# Fit KPCA (simplified)
gamma = 0.5
K = np.exp(-gamma * np.sum((X_train[:, None, :] - X_train[None, :, :])**2, axis=2))
 
# Center kernel
row_means = K.mean(axis=1, keepdims=True)
col_means = K.mean(axis=0, keepdims=True)
grand_mean = K.mean()
K_centered = K - row_means - col_means + grand_mean
 
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
idx = np.argsort(eigenvalues)[::-1]
n_components = 3
eigenvalues = eigenvalues[idx][:n_components]
eigenvectors = eigenvectors[:, idx][:, :n_components]
alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
 
# Project a training point
test_idx = 42
z = K_centered[test_idx] @ alphas
 
# Find pre-image
x_preimage, obj = fixed_point_preimage(z, alphas, X_train, gamma)
 
print(f"Original point: {X_train[test_idx][:3]}...")
print(f"Pre-image estimate: {x_preimage[:3]}...")
print(f"Reconstruction error: {np.linalg.norm(X_train[test_idx] - x_preimage):.4f}")

Gradient-Based Optimization Methods

An alternative to fixed-point iteration is to directly minimize the feature-space reconstruction error using gradient descent.

Objective Function

Define the reconstruction error (for RBF kernel where $|\phi(\mathbf{x})|^2 = 1$):

$$L(\mathbf{x}) = | \phi(\mathbf{x}) - \Phi_{rec} |^2 = 1 - 2\langle \phi(\mathbf{x}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$

where $\tilde{k}$ denotes the centered kernel evaluation.

Computing the Gradient

For the RBF kernel: $$k(\mathbf{x}, \mathbf{x}_i) = \exp(-\gamma |\mathbf{x} - \mathbf{x}_i|^2)$$

The gradient with respect to $\mathbf{x}$ is: $$\frac{\partial k}{\partial \mathbf{x}} = -2\gamma (\mathbf{x} - \mathbf{x}_i) k(\mathbf{x}, \mathbf{x}_i)$$

For the full objective: $$\nabla_\mathbf{x} J = -2\gamma \sum_i \beta_i (\mathbf{x} - \mathbf{x}_i) k(\mathbf{x}, \mathbf{x}_i)$$

(With additional terms from the centering, which we'll address below.)

Simplified Gradient (Ignoring Centering)

For practical purposes, the gradient expression (ignoring centering complications) is:

$$\nabla_\mathbf{x} J \approx -2\gamma \sum_i \beta_i k(\mathbf{x}, \mathbf{x}_i)(\mathbf{x} - \mathbf{x}_i)$$

This can be used in gradient ascent to find local maxima of $J(\mathbf{x})$.

Gradient Ascent Algorithm

Initialize: $\mathbf{x}^{(0)}$ (e.g., training centroid or weighted average)
Iterate:
- Compute kernel values: $k_i = k(\mathbf{x}^{(t)}, \mathbf{x}_i)$
- Compute gradient: $\nabla J = -2\gamma \sum_i \beta_i k_i (\mathbf{x}^{(t)} - \mathbf{x}_i)$
- Update: $\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \eta \nabla J$
- (Optional) Line search for step size $\eta$
Stop when gradient magnitude is below threshold or max iterations reached

Advantages Over Fixed-Point Iteration

More general: works for any differentiable kernel
Better theoretical grounding in optimization
Can use advanced optimizers (Adam, L-BFGS)
Easier to incorporate constraints

Disadvantages

Requires tuning step size
May converge slowly near local optima
Gradient computation can be expensive

Handling Centering in the Gradient

This must be included for correct centering-aware optimization.

gradient_preimage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
 
def gradient_preimage(
    z: np.ndarray,                    # Target projection (k,)
    alphas: np.ndarray,               # KPCA eigenvectors (n, k)
    X_train: np.ndarray,              # Training data (n, d)
    K_train: np.ndarray,              # Training kernel matrix (n, n)
    gamma: float,                     # RBF kernel parameter
    lr: float = 0.1,
    max_iter: int = 500,
    tol: float = 1e-6,
    n_restarts: int = 3
) -> tuple:
    """
    Estimate pre-image using gradient ascent.
    
    Properly handles centering in the objective and gradient.
    """
    n, d = X_train.shape
    
    # Compute combination weights
    betas = alphas @ z  # Shape: (n,)
    
    # Pre-compute training statistics for centering
    K_col_means = K_train.mean(axis=0)  # (n,)
    K_grand_mean = K_train.mean()
    
    def rbf_and_grad(x):
        """Compute RBF kernel values and gradients."""
        diff = x - X_train  # (n, d)
        sq_dist = np.sum(diff**2, axis=1)  # (n,)
        k_vals = np.exp(-gamma * sq_dist)  # (n,)
        # Gradient: dk/dx = -2*gamma*(x-x_i)*k
        dk = -2 * gamma * diff * k_vals[:, None]  # (n, d)
        return k_vals, dk
    
    def objective_and_gradient(x):
        """
        Compute centered objective and gradient.
        
        Centered kernel: k_tilde(x, x_i) = k(x,x_i) - mean_over_training
        """
        k_vals, dk = rbf_and_grad(x)
        
        # Centering for test point
        test_row_mean = k_vals.mean()  # Mean of k(x, x_i) over i
        k_centered = k_vals - K_col_means - test_row_mean + K_grand_mean
        
        # Objective: sum_i beta_i * k_tilde(x, x_i)
        obj = np.dot(betas, k_centered)
        
        # Gradient of centered kernel
        # d(k_centered_i)/dx = dk_i/dx - (1/n)*sum_j dk_j/dx
        dk_mean = dk.mean(axis=0)
        dk_centered = dk - dk_mean  # (n, d)
        
        # Gradient of objective
        grad = np.dot(betas, dk_centered)  # (d,)
        
        return obj, grad
    
    best_x = None
    best_obj = -np.inf
    
    for restart in range(n_restarts):
        # Initialize
        if restart == 0:
            x = X_train.mean(axis=0)
        else:
            x = X_train[np.random.choice(n)].copy()
        
        for iteration in range(max_iter):
            obj, grad = objective_and_gradient(x)
            
            # Gradient ascent step
            x_new = x + lr * grad
            
            # Check convergence
            if np.linalg.norm(x_new - x) < tol:
                break
            
            x = x_new
        
        final_obj, _ = objective_and_gradient(x)
        if final_obj > best_obj:
            best_obj = final_obj
            best_x = x.copy()
    
    return best_x, best_obj
 
 
# Example comparison with fixed-point method
np.random.seed(42)
n, d = 100, 5
X_train = np.random.randn(n, d)
gamma = 0.5
 
# Compute kernel and KPCA
K = np.exp(-gamma * np.sum((X_train[:, None, :] - X_train[None, :, :])**2, axis=2))
row_means = K.mean(axis=1, keepdims=True)
col_means = K.mean(axis=0, keepdims=True)
grand_mean = K.mean()
K_centered = K - row_means - col_means + grand_mean
 
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
idx = np.argsort(eigenvalues)[::-1]
n_components = 3
eigenvalues = eigenvalues[idx][:n_components]
eigenvectors = eigenvectors[:, idx][:, :n_components]
alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
 
# Project and reconstruct
test_idx = 25
z = K_centered[test_idx] @ alphas
 
x_gradient, obj_grad = gradient_preimage(z, alphas, X_train, K, gamma)
 
print(f"Original: {X_train[test_idx][:3]}...")
print(f"Gradient reconstruction: {x_gradient[:3]}...")
print(f"Gradient error: {np.linalg.norm(X_train[test_idx] - x_gradient):.4f}")

Limitations and Practical Considerations

Pre-image estimation is inherently imperfect. Understanding its limitations is crucial for practical applications.

When Pre-images Work Well

High variance explained: When the top $k$ components capture most of the variance, the reconstruction target $\Phi_{rec}$ is close to the original $\phi(\mathbf{x})$, making good pre-images possible.
Smooth kernels: RBF kernels with appropriate bandwidth produce smooth optimization landscapes with fewer local minima.
Dense training data: When training points densely cover the input domain, pre-images can be well-approximated as weighted combinations.
Points near training data: Pre-images for projections of points near the training manifold are more reliable than for extrapolated points.

When Pre-images Fail

Low variance explained: When many components are discarded, the reconstruction target may be far from any valid feature-space image, and pre-images become unreliable.
Extrapolation: For projections corresponding to points outside the training domain, pre-image estimation often fails badly.
Local minima: Non-convex optimization can converge to poor local solutions.
Ill-conditioning: Certain kernel parameters (very small or large $\gamma$) lead to poorly-conditioned optimization.

Critical Limitation: Synthetic Points

Evaluating Pre-image Quality

Several metrics can assess pre-image quality:

1. Reprojection Error

Project the estimated pre-image back to KPCA space and compare: $$\text{Error} = | \mathbf{z} - \mathbf{z}_{\text{reconstructed}} |$$

Small reprojection error is necessary but not sufficient for good pre-images.

2. Feature-Space Distance

Compute (approximately) the distance in feature space: $$| \phi(\hat{\mathbf{x}}) - \Phi_{rec} |^2 \approx 1 - 2\langle \phi(\hat{\mathbf{x}}), \Phi_{rec} \rangle + | \Phi_{rec} |^2$$

3. Input-Space Error (for known pre-images)

When the original input is known, measure: $$| \hat{\mathbf{x}} - \mathbf{x}_{\text{original}} |$$

This is the most meaningful metric but requires ground truth.

4. Consistency Check

For the same projection, run pre-image estimation with different initializations. Large variance indicates unreliable estimates.

Pre-image Method Comparison
Method	Strengths	Weaknesses	When to Use
Fixed-point iteration	Fast, no tuning needed	May not converge, limited to specific kernels	RBF kernel, fast approximation needed
Gradient ascent	General, flexible, can use advanced optimizers	Requires step size tuning, slower	Non-RBF kernels, need more accuracy
Multiple restarts	Avoids some local minima	Expensive, no guarantee of global optimum	Critical applications, time available
Constrained optimization	Enforces validity constraints	More complex, requires constraint definition	Bounded input domains

Applications of Pre-image Estimation

Despite its limitations, pre-image estimation enables important applications of Kernel PCA.

Denoising

The most established application is denoising. The procedure:

Train KPCA on clean data (or data assumed to contain signal + noise)
For noisy input $\mathbf{x}_{\text{noisy}}$, project to top-$k$ KPCA components
Estimate pre-image of this projection: $\hat{\mathbf{x}}_{\text{clean}}$

The projection "filters out" high-frequency noise that doesn't align with the principal components, and the pre-image provides a cleaned approximation in input space.

Image Denoising Example:

Train KPCA on clean face images
Project noisy face images to the principal subspace
Reconstruct via pre-image
The reconstruction preserves face structure while reducing noise

Compression and Reconstruction

KPCA can compress data by storing only the low-dimensional projections:

Project data: $\mathbf{z} = \text{KPCA}(\mathbf{x})$
Store $\mathbf{z}$ (much smaller than $\mathbf{x}$)
Reconstruct when needed via pre-image estimation

This is useful when storage is limited and approximate reconstruction is acceptable.

Denoising Quality Depends on Component Selection

Feature Extraction and Visualization

Understanding what input features drive KPCA components:

Vary coordinates in KPCA space along a single component
Estimate pre-images at each point
Visualize how input space changes

This reveals what input-space variations correspond to each kernel principal component.

Novelty Detection

Pre-image distance can indicate anomalies:

Train KPCA on normal data
For new point $\mathbf{x}$:
- Project to KPCA space
- Estimate pre-image $\hat{\mathbf{x}}$
- Compute $| \mathbf{x} - \hat{\mathbf{x}} |$
Large distance indicates the point doesn't fit the learned manifold

Limitations in Generative Applications

Attempts to use KPCA generatively (sampling in the latent space and finding pre-images) are problematic:

The KPCA space is not designed for generation
Arbitrary points in the latent space may have no valid pre-image
Better alternatives: VAEs, GANs, diffusion models

KPCA is fundamentally a discriminative/descriptive tool, not a generative model.

kpca_denoising.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.optimize import minimize
 
def kpca_denoise(
    X_noisy: np.ndarray,
    X_train: np.ndarray,
    K_train: np.ndarray,
    K_test: np.ndarray,       # Kernel between noisy points and training
    alphas: np.ndarray,       # Normalized KPCA eigenvectors
    gamma: float,
    centering_stats: tuple    # (col_means, grand_mean) from training
) -> np.ndarray:
    """
    Denoise data by projecting to KPCA space and estimating pre-images.
    
    Parameters:
        X_noisy: Noisy input data to denoise (m, d)
        X_train: Training data for KPCA (n, d)
        K_train: Training kernel matrix (n, n)
        K_test: Test-train kernel matrix (m, n)
        alphas: KPCA eigenvectors (n, k)
        gamma: RBF kernel gamma
        centering_stats: Tuple of (column_means, grand_mean) from training
        
    Returns:
        X_denoised: Denoised data (m, d)
    """
    col_means, grand_mean = centering_stats
    m, d = X_noisy.shape
    n = X_train.shape[0]
    
    # Center test kernel
    test_row_means = K_test.mean(axis=1, keepdims=True)
    K_test_centered = K_test - col_means.reshape(1, -1) - test_row_means + grand_mean
    
    # Project noisy data to KPCA space
    Z = K_test_centered @ alphas  # (m, k)
    
    X_denoised = np.zeros_like(X_noisy)
    
    for i in range(m):
        z = Z[i]
        betas = alphas @ z  # Combination weights
        
        # Simple weighted combination as initial estimate
        # (Fast approximation instead of full optimization)
        k_vals = np.exp(-gamma * np.sum((X_noisy[i:i+1] - X_train)**2, axis=1))
        weights = np.abs(betas) * k_vals
        if weights.sum() > 0:
            X_denoised[i] = (weights[:, None] * X_train).sum(axis=0) / weights.sum()
        else:
            X_denoised[i] = X_noisy[i]  # Fallback
    
    return X_denoised
 
 
# Example: denoising 1D function values
np.random.seed(42)
 
# Training data: clean sinusoid samples
n_train = 100
t_train = np.linspace(0, 4*np.pi, n_train)
X_train = np.column_stack([t_train, np.sin(t_train)])
 
# Noisy test data
n_test = 50
t_test = np.linspace(0, 4*np.pi, n_test)
noise = np.random.randn(n_test) * 0.3
X_noisy = np.column_stack([t_test, np.sin(t_test) + noise])
X_clean = np.column_stack([t_test, np.sin(t_test)])
 
# Compute kernels
gamma = 1.0
K_train = np.exp(-gamma * np.sum((X_train[:, None] - X_train[None, :])**2, axis=2))
K_test = np.exp(-gamma * np.sum((X_noisy[:, None] - X_train[None, :])**2, axis=2))
 
# Center training kernel and fit KPCA
col_means = K_train.mean(axis=0)
grand_mean = K_train.mean()
K_train_cent = K_train - col_means - col_means[:, None] + grand_mean
 
eigenvalues, eigenvectors = np.linalg.eigh(K_train_cent)
idx = np.argsort(eigenvalues)[::-1]
k = 5  # Keep 5 components
eigenvalues = eigenvalues[idx][:k]
eigenvectors = eigenvectors[:, idx][:, :k]
alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
 
# Denoise
X_denoised = kpca_denoise(X_noisy, X_train, K_train, K_test, alphas, 
                          gamma, (col_means, grand_mean))
 
# Evaluate
noise_error = np.mean((X_noisy[:, 1] - X_clean[:, 1])**2)
denoised_error = np.mean((X_denoised[:, 1] - X_clean[:, 1])**2)
print(f"Original MSE (noisy vs clean): {noise_error:.4f}")
print(f"Denoised MSE: {denoised_error:.4f}")
print(f"Noise reduction: {(1 - denoised_error/noise_error)*100:.1f}%")

Summary

Key Takeaways

•Linear PCA reconstruction is trivial; Kernel PCA reconstruction is not — The nonlinear feature mapping has no simple inverse
•Exact pre-images typically don't exist — The projection $\Phi_{rec}$ usually doesn't lie on the image of $\phi$
•Fixed-point iteration provides fast approximate pre-images — Works well for RBF kernels with good initialization
•Gradient-based methods offer more generality — Applicable to any differentiable kernel with proper handling of centering
•Pre-image quality depends on explained variance — More components retained means better pre-image estimates
•Primary application is denoising — Project to principal subspace, estimate pre-image to remove noise

Pre-image Problem Understood

What's Next:

4 / 5