Loading content...
While Random Fourier Features approximate kernels by mapping data to an explicit feature space, an alternative strategy attacks the problem from a different angle: approximate the kernel matrix itself.
The Nyström approximation is a classical technique from numerical analysis that constructs a low-rank approximation to the $n \times n$ kernel matrix using only a small subset of $m << n$ columns. The insight is elegant: if the kernel matrix has rapidly decaying eigenvalues (as most practical kernel matrices do), then most of its information is captured by a low-rank structure.
Named after Swedish scientist Evert Johannes Nyström who developed related techniques for integral equations in the early 1930s, this method has become a cornerstone of scalable kernel learning.
By the end of this page, you will understand how the Nyström method constructs low-rank approximations, the mathematical guarantees on approximation quality, various landmark (inducing point) selection strategies, computational costs and trade-offs, and when to prefer Nyström over Random Fourier Features.
Before diving into the Nyström method, we must understand why low-rank approximations work well for kernel matrices.
Eigenvalue decay:
Kernel matrices arising from smooth kernels (like RBF) on bounded domains exhibit a characteristic property: their eigenvalues decay rapidly. If we compute the eigendecomposition:
$$\mathbf{K} = \sum_{i=1}^{n} \lambda_i \mathbf{u}_i \mathbf{u}_i^T$$
the eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n \geq 0$ typically follow a power-law or exponential decay:
$$\lambda_i \approx C \cdot i^{-\alpha} \quad \text{or} \quad \lambda_i \approx C \cdot e^{-\beta i}$$
for constants depending on the kernel and data distribution.
The effective rank of a kernel matrix is the number of eigenvalues that contribute meaningfully to the matrix. For RBF kernels on typical datasets, the effective rank is often $O(\sqrt{n})$ or even $O(\log n)$—much smaller than $n$. This is why low-rank approximations can be extremely accurate with $m << n$ components.
Physical intuition:
Why do kernel matrices have low effective rank? Consider what the kernel measures—similarity between data points. If the data lies on a low-dimensional manifold or has clustered structure, then:
Optimal low-rank approximation:
The best rank-$m$ approximation to any matrix (minimizing Frobenius norm) is given by truncated SVD:
$$\mathbf{K}m = \sum{i=1}^{m} \lambda_i \mathbf{u}_i \mathbf{u}_i^T$$
The approximation error is:
$$|\mathbf{K} - \mathbf{K}_m|F^2 = \sum{i=m+1}^{n} \lambda_i^2$$
With rapid eigenvalue decay, this error is small even for moderate $m$. However, computing the full SVD costs $O(n^3)$—the very cost we're trying to avoid!
| Eigenvalue Index | Relative Magnitude | Cumulative % of Trace |
|---|---|---|
| 1-10 | Large (dominant) | ~50-70% |
| 11-100 | Moderate | ~85-95% |
| 101-1000 | Small | ~99%+ |
| 1001+ | Negligible | Remainder < 1% |
The Nyström method constructs a low-rank approximation to $\mathbf{K}$ using only a subset of $m$ columns, without computing the full eigendecomposition.
The setup:
Select $m$ landmark points (also called inducing points or Nyström samples) from the training set. Without loss of generality, assume these are the first $m$ points. Partition the kernel matrix as:
$$\mathbf{K} = \begin{bmatrix} \mathbf{K}{mm} & \mathbf{K}{mn} \ \mathbf{K}{nm} & \mathbf{K}{nn} \end{bmatrix}$$
where:
The approximation:
The Nyström approximation estimates the full kernel matrix as:
$$\tilde{\mathbf{K}} = \mathbf{K}{:m} \mathbf{K}{mm}^{-1} \mathbf{K}_{m:}$$
where $\mathbf{K}{:m} \in \mathbb{R}^{n \times m}$ contains the first $m$ columns of $\mathbf{K}$, and $\mathbf{K}{m:} = \mathbf{K}_{:m}^T$.
The Nyström approximation only requires computing $\mathbf{K}{:m}$ (all $n$ points against $m$ landmarks) and $\mathbf{K}{mm}$ (landmarks against landmarks). Total kernel evaluations: $nm + m^2/2 \approx nm$ instead of $n^2/2$. For $m << n$, this is a massive saving!
Why does this work?
If the kernel matrix were exactly rank-$m$ with the top $m$ eigenvectors aligned with the landmark columns, the Nyström approximation would be exact. For kernel matrices with rapid eigenvalue decay, the landmarks (if chosen well) capture most of the column space of $\mathbf{K}$.
Mathematically, the Nyström approximation can be viewed as:
The factored form:
For efficient computation, we never form the $n \times n$ matrix $\tilde{\mathbf{K}}$ explicitly. Instead, we use:
$$\tilde{\mathbf{K}} = \mathbf{K}{:m} \mathbf{K}{mm}^{-1} \mathbf{K}_{m:} = \boldsymbol{\Phi} \boldsymbol{\Phi}^T$$
where $\boldsymbol{\Phi} = \mathbf{K}{:m} \mathbf{K}{mm}^{-1/2} \in \mathbb{R}^{n \times m}$ is an explicit feature matrix.
This shows that Nyström, like RFF, provides an explicit $m$-dimensional feature representation!
| Matrix | Size | Cost to Compute | Storage |
|---|---|---|---|
| $\mathbf{K}_{:m}$ (landmark columns) | $n \times m$ | $O(nm \cdot d)$ | $O(nm)$ |
| $\mathbf{K}_{mm}$ (landmark kernel) | $m \times m$ | $O(m^2 \cdot d)$ | $O(m^2)$ |
| $\mathbf{K}_{mm}^{-1/2}$ (Cholesky) | $m \times m$ | $O(m^3)$ | $O(m^2)$ |
| $\boldsymbol{\Phi}$ (features) | $n \times m$ | $O(nm^2)$ | $O(nm)$ |
Let's analyze the complete computational cost of using Nyström approximation for kernel ridge regression.
Step 1: Form the Nyström decomposition
Total for decomposition: $O(nm(d + m) + m^3)$
Step 2: Solve ridge regression
With explicit features $\boldsymbol{\Phi}$, ridge regression becomes:
$$(\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \lambda \mathbf{I})\mathbf{w} = \boldsymbol{\Phi}^T \mathbf{y}$$
Total for regression: $O(nm^2 + m^3)$
| Operation | Exact Kernel Method | Nyström ($m$ landmarks) |
|---|---|---|
| Kernel computation | $O(n^2 d)$ | $O(nmd)$ |
| Training | $O(n^3)$ | $O(nm^2 + m^3)$ |
| Memory | $O(n^2)$ | $O(nm)$ |
| Prediction (per point) | $O(nd)$ | $O(md + m)$ |
| Total Training | $O(n^2 d + n^3)$ | $O(nmd + nm^2 + m^3)$ |
For $m = \sqrt{n}$: training becomes $O(n^2)$ instead of $O(n^3)$. For $m = O(1)$ constant: training becomes $O(n)$ linear! The practical sweet spot is usually $m \approx 1,000-10,000$, giving excellent approximation quality with tractable computation for millions of points.
Comparison with Random Fourier Features:
Both methods produce $m$-dimensional explicit features. The key differences:
| Aspect | Nyström | Random Fourier Features |
|---|---|---|
| Basis functions | Data-dependent (landmarks) | Data-independent (random frequencies) |
| Feature computation | $O(md)$ per point | $O(md)$ per point |
| Preprocessing | $O(nm^2)$ for Cholesky | $O(1)$ (sample frequencies) |
| Kernel dependence | Any kernel | Shift-invariant kernels only |
| Approximation adapts to | Data distribution | Kernel smoothness |
| Out-of-sample extension | Requires landmark storage | Only frequencies stored |
Key insight: Nyström adapts to the data (landmarks are data points), potentially capturing structure that random features miss. RFF is more general (works for any dataset with the same kernel) but may need more features for structured data.
The quality of the Nyström approximation depends critically on how landmarks are selected. Poor landmark choices lead to poor approximations regardless of $m$. Several strategies have been developed:
1. Uniform Random Sampling
The simplest approach: randomly sample $m$ points without replacement.
2. K-means Clustering Centers
Run k-means with $k = m$ clusters, use cluster centers as landmarks.
3. Leverage Score Sampling
Sample points proportional to their "statistical leverage"—how much they contribute to the column space.
$$p_i \propto k(\mathbf{x}_i, \mathbf{x}_i) \cdot \text{(diagonal of pseudoinverse)}$$
Approximate leverage scores can be computed efficiently.
| Method | Selection Cost | Approximation Quality | Ease of Implementation |
|---|---|---|---|
| Uniform Random | $O(n)$ | Moderate | Trivial |
| K-means Centers | $O(nmd \cdot T)$ | Good for clustered data | Easy (use library) |
| Greedy (MaxVol) | $O(nm^2)$ | Near-optimal | Complex |
| Leverage Scores | $O(nm^2)$ approx | Theoretical guarantees | Moderate |
| Determinantal Point Process | $O(n^3)$ exact, $O(nm^2)$ approx | Optimal diversity | Complex |
4. Greedy Selection (MaxVol)
Iteratively select the point that maximizes the volume of the selected set:
1. Initialize with random point or centroid
2. For i = 2 to m:
- Find point that maximizes det(K_{selected})
- Add to selected set
This greedily builds a set with maximum "spread" in the feature space.
5. Determinantal Point Processes (DPP)
Sample landmarks from a DPP, which naturally encourages diversity:
$$P(S) \propto \det(\mathbf{K}_S)$$
Points that are far apart (low kernel values) are more likely to be jointly selected.
For most applications, start with uniform random sampling—it's simple and often sufficient. If results are unsatisfactory, try k-means centers. Reserve leverage score or DPP sampling for cases where you need provable approximation guarantees or have very structured data.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npfrom sklearn.cluster import KMeans def select_landmarks_random(X, m, random_state=None): """ Uniform random landmark selection. Parameters: ----------- X : ndarray of shape (n, d) m : int, number of landmarks random_state : int Returns: -------- indices : ndarray of shape (m,) Indices of selected landmarks """ rng = np.random.RandomState(random_state) n = X.shape[0] return rng.choice(n, size=m, replace=False) def select_landmarks_kmeans(X, m, random_state=None): """ K-means based landmark selection. Returns cluster centers (not original points). """ kmeans = KMeans(n_clusters=m, random_state=random_state, n_init=10) kmeans.fit(X) return kmeans.cluster_centers_ # Shape (m, d) def select_landmarks_greedy(X, m, kernel_func, random_state=None): """ Greedy landmark selection maximizing kernel matrix determinant. Parameters: ----------- X : ndarray of shape (n, d) m : int, number of landmarks kernel_func : callable, kernel function k(X1, X2) Returns: -------- indices : ndarray of shape (m,) Indices of selected landmarks """ rng = np.random.RandomState(random_state) n = X.shape[0] # Start with random point selected = [rng.randint(n)] remaining = set(range(n)) - set(selected) for _ in range(m - 1): best_score = -np.inf best_idx = None # Current kernel matrix among selected points X_selected = X[selected] K_selected = kernel_func(X_selected, X_selected) # Try adding each remaining point for idx in remaining: # Kernel with new point k_new = kernel_func(X[[idx]], X_selected).flatten() # (len(selected),) k_self = kernel_func(X[[idx]], X[[idx]])[0, 0] # Score = log det of augmented matrix (using Schur complement) # det([K, k; k^T, k_self]) = det(K) * (k_self - k^T K^-1 k) # Log: log det(K) + log(k_self - k^T K^-1 k) try: v = np.linalg.solve(K_selected + 1e-6 * np.eye(len(selected)), k_new) schur = k_self - np.dot(k_new, v) if schur > 0: score = np.log(schur) if score > best_score: best_score = score best_idx = idx except: continue if best_idx is not None: selected.append(best_idx) remaining.remove(best_idx) else: # Fallback to random if greedy fails best_idx = rng.choice(list(remaining)) selected.append(best_idx) remaining.remove(best_idx) return np.array(selected)Understanding when and how well Nyström approximation works requires examining its error bounds.
Matrix approximation error:
The Frobenius norm error of the Nyström approximation is:
$$|\mathbf{K} - \tilde{\mathbf{K}}|F = |\mathbf{K} - \mathbf{K}{:m}\mathbf{K}{mm}^{-1}\mathbf{K}{m:}|_F$$
For uniform random sampling with $m$ landmarks, with probability at least $1 - \delta$:
$$|\mathbf{K} - \tilde{\mathbf{K}}|_F \leq |\mathbf{K} - \mathbf{K}_k|_F + O\left(\frac{n}{\sqrt{m}} \cdot \sqrt{\log(1/\delta)}\right)$$
where $\mathbf{K}_k$ is the best rank-$k$ approximation. The first term is unavoidable (SVD error), and the second vanishes as $m \to \infty$.
The approximation error depends heavily on eigenvalue decay. If $\lambda_i \sim i^{-\alpha}$ (polynomial decay), then the error scales as $O(m^{-(\alpha-1)/2})$. If $\lambda_i \sim e^{-\beta i}$ (exponential decay), error decreases exponentially in $m$. Smoother kernels = faster decay = better Nyström performance.
Leverage score sampling guarantees:
With leverage score sampling, stronger guarantees hold:
$$|\mathbf{K} - \tilde{\mathbf{K}}|_F \leq (1 + \epsilon) |\mathbf{K} - \mathbf{K}_k|_F$$
with $m = O(k/\epsilon)$ landmarks. This is near-optimal—matching SVD quality with only a $(1+\epsilon)$ factor!
Prediction error analysis:
More relevant than matrix approximation is how error affects learning. For kernel ridge regression:
$$|\hat{f}{\text{Nyström}} - \hat{f}{\text{exact}}|_{L^2} \leq \frac{|\mathbf{K} - \tilde{\mathbf{K}}|}{\lambda}$$
The regularization parameter $\lambda$ appears in the denominator—stronger regularization reduces sensitivity to approximation error.
| Setting | Error Bound | Rate |
|---|---|---|
| Uniform sampling | $O(n/\sqrt{m})$ | Slow |
| Leverage score sampling | $(1+\epsilon) \cdot \text{optimal}$ | Near-optimal |
| K-means landmarks | Data-dependent, no worst-case guarantee | Often good in practice |
| Greedy selection | Typically $O(1/m)$ in practice | Fast but no formal bound |
Practical quality assessment:
Rather than relying on theoretical bounds, practitioners often assess Nyström quality empirically:
Typically, the approximation saturates at some $m^*$—additional landmarks provide diminishing returns.
Nyström can fail badly if landmarks miss important regions. Example: data has two well-separated clusters, all landmarks come from one cluster. The approximation will be poor for between-cluster kernel entries. K-means or stratified sampling mitigates this risk.
Let's implement a complete Nyström approximation for kernel ridge regression.
The algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
import numpy as npfrom scipy.linalg import cholesky, solve_triangular, solve def rbf_kernel(X1, X2, gamma=1.0): """Compute RBF kernel matrix between X1 and X2.""" sq_dist = ( np.sum(X1**2, axis=1, keepdims=True) + np.sum(X2**2, axis=1) - 2 * X1 @ X2.T ) return np.exp(-gamma * sq_dist) class NystromRidgeRegression: """ Kernel Ridge Regression with Nyström approximation. Parameters: ----------- n_landmarks : int Number of landmark points. gamma : float RBF kernel bandwidth. alpha : float Ridge regularization strength. landmark_method : str 'random' or 'kmeans' random_state : int Random seed. """ def __init__(self, n_landmarks=100, gamma=1.0, alpha=1.0, landmark_method='random', random_state=None): self.n_landmarks = n_landmarks self.gamma = gamma self.alpha = alpha self.landmark_method = landmark_method self.random_state = random_state # Fitted attributes self.landmarks_ = None self.L_mm_ = None # Cholesky of K_mm self.coef_ = None # Regression coefficients def _select_landmarks(self, X): """Select landmark points.""" rng = np.random.RandomState(self.random_state) n = X.shape[0] m = min(self.n_landmarks, n) if self.landmark_method == 'random': indices = rng.choice(n, size=m, replace=False) return X[indices].copy() elif self.landmark_method == 'kmeans': from sklearn.cluster import KMeans km = KMeans(n_clusters=m, random_state=self.random_state, n_init=10) km.fit(X) return km.cluster_centers_ else: raise ValueError(f"Unknown landmark method: {self.landmark_method}") def fit(self, X, y): """ Fit Nyström ridge regression. Parameters: ----------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) """ n_samples = X.shape[0] # Step 1: Select landmarks self.landmarks_ = self._select_landmarks(X) m = self.landmarks_.shape[0] # Step 2: Compute K_nm (all points to landmarks) K_nm = rbf_kernel(X, self.landmarks_, gamma=self.gamma) # (n, m) # Step 3: Extract K_mm (landmarks to landmarks) K_mm = rbf_kernel(self.landmarks_, self.landmarks_, gamma=self.gamma) # (m, m) # Step 4: Cholesky decomposition (with jitter for stability) jitter = 1e-6 while True: try: self.L_mm_ = cholesky(K_mm + jitter * np.eye(m), lower=True) break except np.linalg.LinAlgError: jitter *= 10 if jitter > 1e-2: raise ValueError("K_mm is severely ill-conditioned") # Step 5: Compute Nyström features Phi = K_nm @ L^{-T} # Solve L^T @ Phi^T = K_nm^T => Phi^T = L^{-T} @ K_nm^T Phi = solve_triangular(self.L_mm_, K_nm.T, lower=True).T # (n, m) # Step 6: Solve ridge regression # (Phi^T Phi + alpha I) w = Phi^T y A = Phi.T @ Phi + self.alpha * np.eye(m) b = Phi.T @ y self.coef_ = solve(A, b, assume_a='pos') return self def predict(self, X): """ Predict using fitted model. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Returns: -------- y_pred : ndarray of shape (n_samples,) """ # Compute kernel to landmarks K_test_m = rbf_kernel(X, self.landmarks_, gamma=self.gamma) # Transform to Nyström features Phi_test = solve_triangular(self.L_mm_, K_test_m.T, lower=True).T return Phi_test @ self.coef_ # Demonstrationif __name__ == "__main__": from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import time # Generate data X, y = make_regression(n_samples=10000, n_features=20, noise=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Fit Nyström model print("Fitting Nyström Ridge Regression...") start = time.time() model = NystromRidgeRegression( n_landmarks=500, gamma=0.01, alpha=1.0, landmark_method='random', random_state=42 ) model.fit(X_train, y_train) fit_time = time.time() - start # Predict y_pred = model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) print(f"Training time: {fit_time:.3f}s") print(f"Test RMSE: {rmse:.4f}") print(f"Landmarks: {model.n_landmarks}, Features: {X.shape[1]}")Both Nyström and Random Fourier Features approximate kernel methods with $m$-dimensional explicit features. When should you choose one over the other?
Choose Nyström when:
Choose RFF when:
| Scenario | Recommended Approach | Reason |
|---|---|---|
| RBF kernel, random data | Either | Both work well; RFF simpler |
| RBF kernel, clustered data | Nyström (k-means) | Captures cluster structure |
| Polynomial kernel | Nyström | RFF doesn't apply |
| String/graph kernel | Nyström | RFF requires shift-invariance |
| Online learning | RFF | Data-independent features |
| Need 10,000+ features | RFF | Nyström preprocessing expensive |
| Moderate features (1,000) | Either | Compare empirically |
In practice, you can combine methods: use RFF for additional features once Nyström landmarks are exhausted, or use Nyström on top of RFF-transformed data. Cross-validation is your friend—let the data decide!
The Nyström approximation provides a powerful, versatile approach to scaling kernel methods by exploiting the inherent low-rank structure of kernel matrices.
What's next:
Nyström and RFF are general-purpose approximations that work across kernel methods. Next, we focus specifically on Sparse Gaussian Processes, which combine similar ideas of inducing points with the probabilistic framework of GPs, enabling scalable Bayesian inference with uncertainty quantification.
You now understand the Nyström approximation from theoretical foundations through practical implementation. This technique, combined with RFF, gives you a complete toolkit for scaling kernel methods to datasets that would otherwise be completely intractable.