Computational Aspects Of Kernels - Learning Module

Loading content...

0/245

Nyström Approximation

Approximating the Kernel Matrix Directly

While Random Fourier Features approximate kernels by mapping data to an explicit feature space, an alternative strategy attacks the problem from a different angle: approximate the kernel matrix itself.

The Nyström approximation is a classical technique from numerical analysis that constructs a low-rank approximation to the $n \times n$ kernel matrix using only a small subset of $m << n$ columns. The insight is elegant: if the kernel matrix has rapidly decaying eigenvalues (as most practical kernel matrices do), then most of its information is captured by a low-rank structure.

Named after Swedish scientist Evert Johannes Nyström who developed related techniques for integral equations in the early 1930s, this method has become a cornerstone of scalable kernel learning.

What You Will Learn

By the end of this page, you will understand how the Nyström method constructs low-rank approximations, the mathematical guarantees on approximation quality, various landmark (inducing point) selection strategies, computational costs and trade-offs, and when to prefer Nyström over Random Fourier Features.

Low-Rank Structure in Kernel Matrices

Before diving into the Nyström method, we must understand why low-rank approximations work well for kernel matrices.

Eigenvalue decay:

Kernel matrices arising from smooth kernels (like RBF) on bounded domains exhibit a characteristic property: their eigenvalues decay rapidly. If we compute the eigendecomposition:

$$\mathbf{K} = \sum_{i=1}^{n} \lambda_i \mathbf{u}_i \mathbf{u}_i^T$$

the eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n \geq 0$ typically follow a power-law or exponential decay:

$$\lambda_i \approx C \cdot i^{-\alpha} \quad \text{or} \quad \lambda_i \approx C \cdot e^{-\beta i}$$

for constants depending on the kernel and data distribution.

The Effective Rank

The effective rank of a kernel matrix is the number of eigenvalues that contribute meaningfully to the matrix. For RBF kernels on typical datasets, the effective rank is often $O(\sqrt{n})$ or even $O(\log n)$—much smaller than $n$. This is why low-rank approximations can be extremely accurate with $m << n$ components.

Physical intuition:

Why do kernel matrices have low effective rank? Consider what the kernel measures—similarity between data points. If the data lies on a low-dimensional manifold or has clustered structure, then:

Points within clusters are highly similar (correlated rows/columns)
Points on the same manifold region share similarity patterns
The "information content" is much lower than the $n^2$ entries suggest

Optimal low-rank approximation:

The best rank-$m$ approximation to any matrix (minimizing Frobenius norm) is given by truncated SVD:

$$\mathbf{K}m = \sum{i=1}^{m} \lambda_i \mathbf{u}_i \mathbf{u}_i^T$$

The approximation error is:

$$|\mathbf{K} - \mathbf{K}_m|F^2 = \sum{i=m+1}^{n} \lambda_i^2$$

With rapid eigenvalue decay, this error is small even for moderate $m$. However, computing the full SVD costs $O(n^3)$—the very cost we're trying to avoid!

Typical Eigenvalue Decay for RBF Kernel
Eigenvalue Index	Relative Magnitude	Cumulative % of Trace
1-10	Large (dominant)	~50-70%
11-100	Moderate	~85-95%
101-1000	Small	~99%+
1001+	Negligible	Remainder < 1%

The Nyström Method

The Nyström method constructs a low-rank approximation to $\mathbf{K}$ using only a subset of $m$ columns, without computing the full eigendecomposition.

The setup:

Select $m$ landmark points (also called inducing points or Nyström samples) from the training set. Without loss of generality, assume these are the first $m$ points. Partition the kernel matrix as:

$$\mathbf{K} = \begin{bmatrix} \mathbf{K}{mm} & \mathbf{K}{mn} \ \mathbf{K}{nm} & \mathbf{K}{nn} \end{bmatrix}$$

where:

$\mathbf{K}_{mm}$ is $m \times m$: kernel among landmarks
$\mathbf{K}{mn} = \mathbf{K}{nm}^T$ is $m \times (n-m)$: kernel between landmarks and other points
$\mathbf{K}_{nn}$ is $(n-m) \times (n-m)$: kernel among non-landmarks (never computed!)

The approximation:

The Nyström approximation estimates the full kernel matrix as:

$$\tilde{\mathbf{K}} = \mathbf{K}{:m} \mathbf{K}{mm}^{-1} \mathbf{K}_{m:}$$

where $\mathbf{K}{:m} \in \mathbb{R}^{n \times m}$ contains the first $m$ columns of $\mathbf{K}$, and $\mathbf{K}{m:} = \mathbf{K}_{:m}^T$.

The Key Insight

The Nyström approximation only requires computing $\mathbf{K}{:m}$ (all $n$ points against $m$ landmarks) and $\mathbf{K}{mm}$ (landmarks against landmarks). Total kernel evaluations: $nm + m^2/2 \approx nm$ instead of $n^2/2$. For $m << n$, this is a massive saving!

Why does this work?

If the kernel matrix were exactly rank-$m$ with the top $m$ eigenvectors aligned with the landmark columns, the Nyström approximation would be exact. For kernel matrices with rapid eigenvalue decay, the landmarks (if chosen well) capture most of the column space of $\mathbf{K}$.

Mathematically, the Nyström approximation can be viewed as:

Projecting the kernel matrix onto the subspace spanned by landmark columns
Interpolating the remaining entries based on this projection

The factored form:

For efficient computation, we never form the $n \times n$ matrix $\tilde{\mathbf{K}}$ explicitly. Instead, we use:

$$\tilde{\mathbf{K}} = \mathbf{K}{:m} \mathbf{K}{mm}^{-1} \mathbf{K}_{m:} = \boldsymbol{\Phi} \boldsymbol{\Phi}^T$$

where $\boldsymbol{\Phi} = \mathbf{K}{:m} \mathbf{K}{mm}^{-1/2} \in \mathbb{R}^{n \times m}$ is an explicit feature matrix.

This shows that Nyström, like RFF, provides an explicit $m$-dimensional feature representation!

Nyström Matrix Decomposition
Matrix	Size	Cost to Compute	Storage
$\mathbf{K}_{:m}$ (landmark columns)	$n \times m$	$O(nm \cdot d)$	$O(nm)$
$\mathbf{K}_{mm}$ (landmark kernel)	$m \times m$	$O(m^2 \cdot d)$	$O(m^2)$
$\mathbf{K}_{mm}^{-1/2}$ (Cholesky)	$m \times m$	$O(m^3)$	$O(m^2)$
$\boldsymbol{\Phi}$ (features)	$n \times m$	$O(nm^2)$	$O(nm)$

Computational Complexity Analysis

Let's analyze the complete computational cost of using Nyström approximation for kernel ridge regression.

Step 1: Form the Nyström decomposition

Select $m$ landmarks: cost depends on method (see next section)
Compute $\mathbf{K}_{:m}$: $O(nm \cdot d)$ kernel evaluations
Extract $\mathbf{K}{mm}$ (subset of $\mathbf{K}{:m}$): $O(m^2)$
Compute $\mathbf{K}_{mm}^{-1/2}$ via Cholesky: $O(m^3)$
Compute $\boldsymbol{\Phi} = \mathbf{K}{:m} \mathbf{L}{mm}^{-1}$: $O(nm^2)$

Total for decomposition: $O(nm(d + m) + m^3)$

Step 2: Solve ridge regression

With explicit features $\boldsymbol{\Phi}$, ridge regression becomes:

$$(\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{I})\mathbf{w} = \boldsymbol{\Phi}^T \mathbf{y}$$

Compute $\boldsymbol{\Phi}^T \boldsymbol{\Phi}$: $O(nm^2)$
Add regularization, solve: $O(m^3)$

Total for regression: $O(nm^2 + m^3)$

Complexity Comparison: Exact vs. Nyström
Operation	Exact Kernel Method	Nyström ($m$ landmarks)
Kernel computation	$O(n^2 d)$	$O(nmd)$
Training	$O(n^3)$	$O(nm^2 + m^3)$
Memory	$O(n^2)$	$O(nm)$
Prediction (per point)	$O(nd)$	$O(md + m)$
Total Training	$O(n^2 d + n^3)$	$O(nmd + nm^2 + m^3)$

When Nyström Wins

For $m = \sqrt{n}$: training becomes $O(n^2)$ instead of $O(n^3)$. For $m = O(1)$ constant: training becomes $O(n)$ linear! The practical sweet spot is usually $m \approx 1,000-10,000$, giving excellent approximation quality with tractable computation for millions of points.

Comparison with Random Fourier Features:

Both methods produce $m$-dimensional explicit features. The key differences:

Aspect	Nyström	Random Fourier Features
Basis functions	Data-dependent (landmarks)	Data-independent (random frequencies)
Feature computation	$O(md)$ per point	$O(md)$ per point
Preprocessing	$O(nm^2)$ for Cholesky	$O(1)$ (sample frequencies)
Kernel dependence	Any kernel	Shift-invariant kernels only
Approximation adapts to	Data distribution	Kernel smoothness
Out-of-sample extension	Requires landmark storage	Only frequencies stored

Key insight: Nyström adapts to the data (landmarks are data points), potentially capturing structure that random features miss. RFF is more general (works for any dataset with the same kernel) but may need more features for structured data.

Landmark Selection Strategies

The quality of the Nyström approximation depends critically on how landmarks are selected. Poor landmark choices lead to poor approximations regardless of $m$. Several strategies have been developed:

1. Uniform Random Sampling

The simplest approach: randomly sample $m$ points without replacement.

Pros: Simple, $O(n)$ selection time, embarrassingly parallel
Cons: May miss rare/important regions, no quality guarantees
When to use: Large datasets where computation matters more than optimality

2. K-means Clustering Centers

Run k-means with $k = m$ clusters, use cluster centers as landmarks.

Pros: Spreads landmarks across data space, captures structure
Cons: K-means cost $O(nmd \cdot \text{iterations})$, sensitive to initialization
When to use: Moderate-sized datasets with clear cluster structure

3. Leverage Score Sampling

Sample points proportional to their "statistical leverage"—how much they contribute to the column space.

$$p_i \propto k(\mathbf{x}_i, \mathbf{x}_i) \cdot \text{(diagonal of pseudoinverse)}$$

Approximate leverage scores can be computed efficiently.

Pros: Theoretical guarantees, captures important points
Cons: More complex implementation, requires approximate precomputation
When to use: When approximation quality is critical

Landmark Selection Methods Comparison
Method	Selection Cost	Approximation Quality	Ease of Implementation
Uniform Random	$O(n)$	Moderate	Trivial
K-means Centers	$O(nmd \cdot T)$	Good for clustered data	Easy (use library)
Greedy (MaxVol)	$O(nm^2)$	Near-optimal	Complex
Leverage Scores	$O(nm^2)$ approx	Theoretical guarantees	Moderate
Determinantal Point Process	$O(n^3)$ exact, $O(nm^2)$ approx	Optimal diversity	Complex

4. Greedy Selection (MaxVol)

Iteratively select the point that maximizes the volume of the selected set:

1. Initialize with random point or centroid
2. For i = 2 to m:
   - Find point that maximizes det(K_{selected})
   - Add to selected set

This greedily builds a set with maximum "spread" in the feature space.

5. Determinantal Point Processes (DPP)

Sample landmarks from a DPP, which naturally encourages diversity:

$$P(S) \propto \det(\mathbf{K}_S)$$

Points that are far apart (low kernel values) are more likely to be jointly selected.

Pros: Optimal diversity, elegant theory
Cons: Exact sampling is $O(n^3)$, though fast approximations exist

Practical Recommendation

For most applications, start with uniform random sampling—it's simple and often sufficient. If results are unsatisfactory, try k-means centers. Reserve leverage score or DPP sampling for cases where you need provable approximation guarantees or have very structured data.

landmark_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from sklearn.cluster import KMeans
 
def select_landmarks_random(X, m, random_state=None):
    """
    Uniform random landmark selection.
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
    m : int, number of landmarks
    random_state : int
    
    Returns:
    --------
    indices : ndarray of shape (m,)
        Indices of selected landmarks
    """
    rng = np.random.RandomState(random_state)
    n = X.shape[0]
    return rng.choice(n, size=m, replace=False)
 
 
def select_landmarks_kmeans(X, m, random_state=None):
    """
    K-means based landmark selection.
    
    Returns cluster centers (not original points).
    """
    kmeans = KMeans(n_clusters=m, random_state=random_state, n_init=10)
    kmeans.fit(X)
    return kmeans.cluster_centers_  # Shape (m, d)
 
 
def select_landmarks_greedy(X, m, kernel_func, random_state=None):
    """
    Greedy landmark selection maximizing kernel matrix determinant.
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
    m : int, number of landmarks
    kernel_func : callable, kernel function k(X1, X2)
    
    Returns:
    --------
    indices : ndarray of shape (m,)
        Indices of selected landmarks
    """
    rng = np.random.RandomState(random_state)
    n = X.shape[0]
    
    # Start with random point
    selected = [rng.randint(n)]
    remaining = set(range(n)) - set(selected)
    
    for _ in range(m - 1):
        best_score = -np.inf
        best_idx = None
        
        # Current kernel matrix among selected points
        X_selected = X[selected]
        K_selected = kernel_func(X_selected, X_selected)
        
        # Try adding each remaining point
        for idx in remaining:
            # Kernel with new point
            k_new = kernel_func(X[[idx]], X_selected).flatten()  # (len(selected),)
            k_self = kernel_func(X[[idx]], X[[idx]])[0, 0]
            
            # Score = log det of augmented matrix (using Schur complement)
            # det([K, k; k^T, k_self]) = det(K) * (k_self - k^T K^-1 k)
            # Log: log det(K) + log(k_self - k^T K^-1 k)
            try:
                v = np.linalg.solve(K_selected + 1e-6 * np.eye(len(selected)), k_new)
                schur = k_self - np.dot(k_new, v)
                if schur > 0:
                    score = np.log(schur)
                    if score > best_score:
                        best_score = score
                        best_idx = idx
            except:
                continue
        
        if best_idx is not None:
            selected.append(best_idx)
            remaining.remove(best_idx)
        else:
            # Fallback to random if greedy fails
            best_idx = rng.choice(list(remaining))
            selected.append(best_idx)
            remaining.remove(best_idx)
    
    return np.array(selected)

Approximation Error Bounds

Understanding when and how well Nyström approximation works requires examining its error bounds.

Matrix approximation error:

The Frobenius norm error of the Nyström approximation is:

$$|\mathbf{K} - \tilde{\mathbf{K}}|F = |\mathbf{K} - \mathbf{K}{:m}\mathbf{K}{mm}^{-1}\mathbf{K}{m:}|_F$$

For uniform random sampling with $m$ landmarks, with probability at least $1 - \delta$:

$$|\mathbf{K} - \tilde{\mathbf{K}}|_F \leq |\mathbf{K} - \mathbf{K}_k|_F + O\left(\frac{n}{\sqrt{m}} \cdot \sqrt{\log(1/\delta)}\right)$$

where $\mathbf{K}_k$ is the best rank-$k$ approximation. The first term is unavoidable (SVD error), and the second vanishes as $m \to \infty$.

The Eigendecay Connection

The approximation error depends heavily on eigenvalue decay. If $\lambda_i \sim i^{-\alpha}$ (polynomial decay), then the error scales as $O(m^{-(\alpha-1)/2})$. If $\lambda_i \sim e^{-\beta i}$ (exponential decay), error decreases exponentially in $m$. Smoother kernels = faster decay = better Nyström performance.

Leverage score sampling guarantees:

With leverage score sampling, stronger guarantees hold:

$$|\mathbf{K} - \tilde{\mathbf{K}}|_F \leq (1 + \epsilon) |\mathbf{K} - \mathbf{K}_k|_F$$

with $m = O(k/\epsilon)$ landmarks. This is near-optimal—matching SVD quality with only a $(1+\epsilon)$ factor!

Prediction error analysis:

More relevant than matrix approximation is how error affects learning. For kernel ridge regression:

$$|\hat{f}{\text{Nyström}} - \hat{f}{\text{exact}}|_{L^2} \leq \frac{|\mathbf{K} - \tilde{\mathbf{K}}|}{\lambda}$$

The regularization parameter $\lambda$ appears in the denominator—stronger regularization reduces sensitivity to approximation error.

Error Bounds for Different Settings
Setting	Error Bound	Rate
Uniform sampling	$O(n/\sqrt{m})$	Slow
Leverage score sampling	$(1+\epsilon) \cdot \text{optimal}$	Near-optimal
K-means landmarks	Data-dependent, no worst-case guarantee	Often good in practice
Greedy selection	Typically $O(1/m)$ in practice	Fast but no formal bound

Practical quality assessment:

Rather than relying on theoretical bounds, practitioners often assess Nyström quality empirically:

Kernel approximation error: Sample pairs $(i,j)$, compare $K_{ij}$ to $\tilde{K}_{ij}$
Eigenvalue comparison: Compare eigenvalues of $\tilde{\mathbf{K}}$ to $\mathbf{K}$ (on a subsample)
Holdout prediction error: Does predicted accuracy match exact kernel method?
Varying $m$: Plot accuracy vs. $m$; find the plateau

Typically, the approximation saturates at some $m^*$—additional landmarks provide diminishing returns.

Pathological Cases

Nyström can fail badly if landmarks miss important regions. Example: data has two well-separated clusters, all landmarks come from one cluster. The approximation will be poor for between-cluster kernel entries. K-means or stratified sampling mitigates this risk.

Practical Implementation

Let's implement a complete Nyström approximation for kernel ridge regression.

The algorithm:

Select $m$ landmarks from training data
Compute kernel between all points and landmarks: $\mathbf{K}_{nm} \in \mathbb{R}^{n \times m}$
Extract kernel among landmarks: $\mathbf{K}_{mm} \in \mathbb{R}^{m \times m}$
Compute Cholesky decomposition: $\mathbf{K}_{mm} = \mathbf{L}\mathbf{L}^T$
Form features: $\boldsymbol{\Phi} = \mathbf{K}_{nm} \mathbf{L}^{-T}$
Solve ridge: $(\boldsymbol{\Phi}^T\boldsymbol{\Phi} + \lambda \mathbf{I})\mathbf{w} = \boldsymbol{\Phi}^T \mathbf{y}$

nystrom_ridge.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
from scipy.linalg import cholesky, solve_triangular, solve
 
def rbf_kernel(X1, X2, gamma=1.0):
    """Compute RBF kernel matrix between X1 and X2."""
    sq_dist = (
        np.sum(X1**2, axis=1, keepdims=True) +
        np.sum(X2**2, axis=1) -
        2 * X1 @ X2.T
    )
    return np.exp(-gamma * sq_dist)
 
 
class NystromRidgeRegression:
    """
    Kernel Ridge Regression with Nyström approximation.
    
    Parameters:
    -----------
    n_landmarks : int
        Number of landmark points.
    gamma : float
        RBF kernel bandwidth.
    alpha : float
        Ridge regularization strength.
    landmark_method : str
        'random' or 'kmeans'
    random_state : int
        Random seed.
    """
    
    def __init__(self, n_landmarks=100, gamma=1.0, alpha=1.0,
                 landmark_method='random', random_state=None):
        self.n_landmarks = n_landmarks
        self.gamma = gamma
        self.alpha = alpha
        self.landmark_method = landmark_method
        self.random_state = random_state
        
        # Fitted attributes
        self.landmarks_ = None
        self.L_mm_ = None  # Cholesky of K_mm
        self.coef_ = None  # Regression coefficients
    
    def _select_landmarks(self, X):
        """Select landmark points."""
        rng = np.random.RandomState(self.random_state)
        n = X.shape[0]
        m = min(self.n_landmarks, n)
        
        if self.landmark_method == 'random':
            indices = rng.choice(n, size=m, replace=False)
            return X[indices].copy()
        elif self.landmark_method == 'kmeans':
            from sklearn.cluster import KMeans
            km = KMeans(n_clusters=m, random_state=self.random_state, n_init=10)
            km.fit(X)
            return km.cluster_centers_
        else:
            raise ValueError(f"Unknown landmark method: {self.landmark_method}")
    
    def fit(self, X, y):
        """
        Fit Nyström ridge regression.
        
        Parameters:
        -----------
        X : ndarray of shape (n_samples, n_features)
        y : ndarray of shape (n_samples,)
        """
        n_samples = X.shape[0]
        
        # Step 1: Select landmarks
        self.landmarks_ = self._select_landmarks(X)
        m = self.landmarks_.shape[0]
        
        # Step 2: Compute K_nm (all points to landmarks)
        K_nm = rbf_kernel(X, self.landmarks_, gamma=self.gamma)  # (n, m)
        
        # Step 3: Extract K_mm (landmarks to landmarks)
        K_mm = rbf_kernel(self.landmarks_, self.landmarks_, gamma=self.gamma)  # (m, m)
        
        # Step 4: Cholesky decomposition (with jitter for stability)
        jitter = 1e-6
        while True:
            try:
                self.L_mm_ = cholesky(K_mm + jitter * np.eye(m), lower=True)
                break
            except np.linalg.LinAlgError:
                jitter *= 10
                if jitter > 1e-2:
                    raise ValueError("K_mm is severely ill-conditioned")
        
        # Step 5: Compute Nyström features Phi = K_nm @ L^{-T}
        # Solve L^T @ Phi^T = K_nm^T  =>  Phi^T = L^{-T} @ K_nm^T
        Phi = solve_triangular(self.L_mm_, K_nm.T, lower=True).T  # (n, m)
        
        # Step 6: Solve ridge regression
        # (Phi^T Phi + alpha I) w = Phi^T y
        A = Phi.T @ Phi + self.alpha * np.eye(m)
        b = Phi.T @ y
        self.coef_ = solve(A, b, assume_a='pos')
        
        return self
    
    def predict(self, X):
        """
        Predict using fitted model.
        
        Parameters:
        -----------
        X : ndarray of shape (n_samples, n_features)
        
        Returns:
        --------
        y_pred : ndarray of shape (n_samples,)
        """
        # Compute kernel to landmarks
        K_test_m = rbf_kernel(X, self.landmarks_, gamma=self.gamma)
        
        # Transform to Nyström features
        Phi_test = solve_triangular(self.L_mm_, K_test_m.T, lower=True).T
        
        return Phi_test @ self.coef_
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    import time
    
    # Generate data
    X, y = make_regression(n_samples=10000, n_features=20, noise=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # Fit Nyström model
    print("Fitting Nyström Ridge Regression...")
    start = time.time()
    model = NystromRidgeRegression(
        n_landmarks=500,
        gamma=0.01,
        alpha=1.0,
        landmark_method='random',
        random_state=42
    )
    model.fit(X_train, y_train)
    fit_time = time.time() - start
    
    # Predict
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print(f"Training time: {fit_time:.3f}s")
    print(f"Test RMSE: {rmse:.4f}")
    print(f"Landmarks: {model.n_landmarks}, Features: {X.shape[1]}")

Implementation Best Practices

•Add jitter to $\mathbf{K}_{mm}$: Near-duplicate landmarks cause ill-conditioning. Small jitter ($10^{-6}$) stabilizes Cholesky.
•Store landmarks for prediction: The landmark points must be saved—they define the feature space for new data.
•Use triangular solves: Computing $\mathbf{L}^{-1}$ explicitly is wasteful; use solve_triangular instead.
•Monitor landmark quality: If some landmarks are very similar, consider deduplication or larger jitter.
•Scale data: Standardize inputs before computing kernels to ensure consistent kernel bandwidth interpretation.

Nyström vs. RFF — When to Use Which

Both Nyström and Random Fourier Features approximate kernel methods with $m$-dimensional explicit features. When should you choose one over the other?

Choose Nyström when:

Non-shift-invariant kernels: RFF requires Bochner's theorem; Nyström works for any kernel
Structured data: If data has clusters or manifold structure, landmarks can adapt
Kernel has slow spectral decay: Nyström captures the top eigenfunctions more directly
Interpretability: Landmarks are actual data points with semantic meaning
You have good landmark selection: K-means or leverage scores can significantly improve quality

Choose RFF when:

Shift-invariant kernels (RBF, Laplacian): RFF is specifically designed for these
No data-dependent preprocessing: RFF is simple—sample frequencies and transform
Streaming/online setting: Frequencies are fixed; can transform data as it arrives
Privacy concerns: RFF doesn't require storing actual data points
Large number of approximation dimensions needed: RFF scales better with large $m$

Decision Matrix: Nyström vs. RFF
Scenario	Recommended Approach	Reason
RBF kernel, random data	Either	Both work well; RFF simpler
RBF kernel, clustered data	Nyström (k-means)	Captures cluster structure
Polynomial kernel	Nyström	RFF doesn't apply
String/graph kernel	Nyström	RFF requires shift-invariance
Online learning	RFF	Data-independent features
Need 10,000+ features	RFF	Nyström preprocessing expensive
Moderate features (1,000)	Either	Compare empirically

The Hybrid Approach

In practice, you can combine methods: use RFF for additional features once Nyström landmarks are exhausted, or use Nyström on top of RFF-transformed data. Cross-validation is your friend—let the data decide!

Summary: The Power of Low-Rank Approximation

The Nyström approximation provides a powerful, versatile approach to scaling kernel methods by exploiting the inherent low-rank structure of kernel matrices.

Key Takeaways

•Kernel matrices have low effective rank due to eigenvalue decay—most information is captured by a small number of eigenfunctions.
•Nyström approximates using landmarks: Instead of computing $n^2$ kernel entries, compute $nm$ entries between all points and $m$ landmarks.
•Complexity reduction is dramatic: From $O(n^3)$ to $O(nm^2 + m^3)$ training, enabling million-point datasets with $m = O(1000)$.
•Landmark selection matters: Uniform random is simple; k-means captures clusters; leverage scores provide guarantees.
•Error bounds depend on eigendecay: Smoother kernels with faster decay yield better approximations for the same $m$.
•Nyström vs. RFF tradeoffs: Nyström adapts to data structure but requires landmark storage; RFF is simpler but restricted to shift-invariant kernels.

What's next:

Nyström and RFF are general-purpose approximations that work across kernel methods. Next, we focus specifically on Sparse Gaussian Processes, which combine similar ideas of inducing points with the probabilistic framework of GPs, enabling scalable Bayesian inference with uncertainty quantification.

Page Complete

You now understand the Nyström approximation from theoretical foundations through practical implementation. This technique, combined with RFF, gives you a complete toolkit for scaling kernel methods to datasets that would otherwise be completely intractable.