Loading content...
In 2007, Ali Rahimi and Benjamin Recht published a paper that fundamentally changed how we think about kernel methods. Their key insight was elegant and surprising:
Shift-invariant kernels can be approximated by explicit, finite-dimensional random feature mappings.
This means that instead of computing the $n \times n$ kernel matrix—with its quadratic memory and cubic training costs—we can transform our data into a $D$-dimensional feature space and use standard linear methods. The computational complexity drops from $O(n^3)$ to $O(nD^2)$, and with $D << n$, kernel methods become feasible for datasets of virtually unlimited size.
This technique, known as Random Fourier Features (RFF), bridges the gap between kernel methods and linear models, combining the expressiveness of the former with the scalability of the latter.
By the end of this page, you will understand Bochner's theorem and its connection to kernels, how to derive the Random Fourier Feature mapping for RBF kernels, the mathematical guarantees on approximation quality, practical implementation considerations, and extensions to other kernel types.
The mathematical foundation of Random Fourier Features is Bochner's theorem, a classical result from harmonic analysis that connects positive definite functions to Fourier transforms.
Shift-invariant kernels:
A kernel $k(\mathbf{x}, \mathbf{x}')$ is shift-invariant (or stationary) if it depends only on the difference $\boldsymbol{\delta} = \mathbf{x} - \mathbf{x}'$:
$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\delta})$$
Examples of shift-invariant kernels include:
Crucially, the polynomial and linear kernels are not shift-invariant—they depend on absolute positions, not just differences.
A continuous shift-invariant kernel $k(\boldsymbol{\delta})$ is positive definite if and only if it is the Fourier transform of a non-negative measure $p(\boldsymbol{\omega})$:
$$k(\boldsymbol{\delta}) = \int_{\mathbb{R}^d} p(\boldsymbol{\omega}) e^{i \boldsymbol{\omega}^T \boldsymbol{\delta}} d\boldsymbol{\omega}$$
If $k(\mathbf{0}) = 1$ (kernel normalized), then $p(\boldsymbol{\omega})$ is a proper probability distribution.
Understanding the theorem:
Bochner's theorem tells us that every valid shift-invariant kernel has a spectral representation—it can be written as an expectation over random frequencies:
$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim p} \left[ e^{i \boldsymbol{\omega}^T (\mathbf{x} - \mathbf{x}')} \right]$$
Since the kernel is real-valued, we can use the real part:
$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim p} \left[ \cos(\boldsymbol{\omega}^T (\mathbf{x} - \mathbf{x}')) \right]$$
Using the trigonometric identity $\cos(a - b) = \cos(a)\cos(b) + \sin(a)\sin(b)$:
$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim p} \left[ \cos(\boldsymbol{\omega}^T \mathbf{x})\cos(\boldsymbol{\omega}^T \mathbf{x}') + \sin(\boldsymbol{\omega}^T \mathbf{x})\sin(\boldsymbol{\omega}^T \mathbf{x}') \right]$$
This is an inner product! Define the random feature map:
$$\phi_{\boldsymbol{\omega}}(\mathbf{x}) = \begin{bmatrix} \cos(\boldsymbol{\omega}^T \mathbf{x}) \ \sin(\boldsymbol{\omega}^T \mathbf{x}) \end{bmatrix}$$
Then: $k(\mathbf{x}, \mathbf{x}') = \mathbb{E}{\boldsymbol{\omega}} [\phi{\boldsymbol{\omega}}(\mathbf{x})^T \phi_{\boldsymbol{\omega}}(\mathbf{x}')]$
| Kernel | Formula | Spectral Distribution $p(\boldsymbol{\omega})$ |
|---|---|---|
| RBF (Gaussian) | $\exp(-\gamma|\boldsymbol{\delta}|^2)$ | $\mathcal{N}(\mathbf{0}, 2\gamma \mathbf{I})$ — Gaussian with variance $2\gamma$ |
| Laplacian | $\exp(-\gamma|\boldsymbol{\delta}|_1)$ | Product of Cauchy distributions with scale $\gamma$ |
| Matérn ($\nu = 1/2$) | $\exp(-\gamma|\boldsymbol{\delta}|)$ | Student-t with appropriate degrees of freedom |
| Cauchy | $\prod_j \frac{1}{1+\gamma^2 \delta_j^2}$ | Product of Laplace distributions |
The key insight of Random Fourier Features is to approximate the expectation with a Monte Carlo average. Instead of integrating over all possible frequencies, we sample a finite set and average:
$$k(\mathbf{x}, \mathbf{x}') \approx \hat{k}(\mathbf{x}, \mathbf{x}') = \frac{1}{D} \sum_{j=1}^{D} \phi_{\boldsymbol{\omega}j}(\mathbf{x})^T \phi{\boldsymbol{\omega}_j}(\mathbf{x}')$$
where $\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \overset{\text{i.i.d.}}{\sim} p(\boldsymbol{\omega})$.
The explicit feature map:
We can rewrite this as an inner product of finite-dimensional features:
$$\hat{k}(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^T \boldsymbol{\phi}(\mathbf{x}')$$
where the Random Fourier Feature map $\boldsymbol{\phi}: \mathbb{R}^d \to \mathbb{R}^{2D}$ is:
$$\boldsymbol{\phi}(\mathbf{x}) = \sqrt{\frac{1}{D}} \begin{bmatrix} \cos(\boldsymbol{\omega}_1^T \mathbf{x}) \ \sin(\boldsymbol{\omega}_1^T \mathbf{x}) \ \vdots \ \cos(\boldsymbol{\omega}_D^T \mathbf{x}) \ \sin(\boldsymbol{\omega}_D^T \mathbf{x}) \end{bmatrix}$$
Random Fourier Features transform a kernel machine into a linear model! After computing $\boldsymbol{\phi}(\mathbf{x})$ for each training point, we have an explicit feature matrix $\boldsymbol{\Phi} \in \mathbb{R}^{n \times 2D}$, and we can use standard linear regression or classification methods.
Alternative formulation with random offsets:
A slightly different but equivalent formulation uses random phase offsets instead of paired sine/cosine:
$$\phi_j(\mathbf{x}) = \sqrt{\frac{2}{D}} \cos(\boldsymbol{\omega}_j^T \mathbf{x} + b_j)$$
where $b_j \sim \text{Uniform}[0, 2\pi]$. This gives $D$-dimensional features (not $2D$) with the same approximation quality.
The complete algorithm:
Steps 1-2 happen once; step 3 is $O(nDd)$; step 4 is $O(nD^2 + D^3)$ for ridge regression.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
import numpy as npfrom scipy.linalg import solve class RandomFourierFeatures: """ Random Fourier Features for RBF kernel approximation. Transforms data into explicit features such that phi(x).T @ phi(x') ≈ exp(-gamma * ||x - x'||^2) Parameters: ----------- n_components : int Number of random features (D). Higher = better approximation. gamma : float RBF kernel bandwidth parameter. use_offset : bool If True, use cos(w.T @ x + b) formulation (D features). If False, use [cos(w.T @ x), sin(w.T @ x)] formulation (2D features). random_state : int or None Random seed for reproducibility. """ def __init__(self, n_components=100, gamma=1.0, use_offset=True, random_state=None): self.n_components = n_components self.gamma = gamma self.use_offset = use_offset self.random_state = random_state self.W_ = None # Random frequencies self.b_ = None # Random offsets def fit(self, X): """ Sample random frequencies from the spectral distribution. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Training data (only used to get dimensionality). Returns: -------- self """ rng = np.random.RandomState(self.random_state) n_features = X.shape[1] # For RBF kernel with exp(-gamma * ||x-x'||^2), # spectral distribution is N(0, 2*gamma * I) # Standard deviation is sqrt(2 * gamma) self.W_ = rng.randn(n_features, self.n_components) * np.sqrt(2 * self.gamma) if self.use_offset: self.b_ = rng.uniform(0, 2 * np.pi, size=self.n_components) return self def transform(self, X): """ Apply the random feature mapping. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Input data. Returns: -------- X_transformed : ndarray Transformed features. """ # Compute W^T @ x for all samples: (n_samples, n_components) projection = X @ self.W_ if self.use_offset: # cos(W^T @ x + b), D-dimensional output features = np.cos(projection + self.b_) return features * np.sqrt(2.0 / self.n_components) else: # [cos(W^T @ x), sin(W^T @ x)], 2D-dimensional output cos_features = np.cos(projection) sin_features = np.sin(projection) features = np.hstack([cos_features, sin_features]) return features * np.sqrt(1.0 / self.n_components) def fit_transform(self, X): """Fit and transform in one step.""" return self.fit(X).transform(X) class RFFRidgeRegression: """ Ridge regression with Random Fourier Features. Approximates kernel ridge regression with RBF kernel at dramatically reduced computational cost. """ def __init__(self, n_components=100, gamma=1.0, alpha=1.0, random_state=None): self.n_components = n_components self.gamma = gamma self.alpha = alpha # Regularization strength self.random_state = random_state self.rff_ = None self.coef_ = None def fit(self, X, y): """Fit RFF ridge regression model.""" # Create and apply RFF transformation self.rff_ = RandomFourierFeatures( n_components=self.n_components, gamma=self.gamma, random_state=self.random_state ) Phi = self.rff_.fit_transform(X) # Solve ridge regression: (Phi^T Phi + alpha I) w = Phi^T y n_features = Phi.shape[1] A = Phi.T @ Phi + self.alpha * np.eye(n_features) b = Phi.T @ y self.coef_ = solve(A, b, assume_a='pos') return self def predict(self, X): """Predict using the fitted model.""" Phi = self.rff_.transform(X) return Phi @ self.coef_ # Example usage comparisonif __name__ == "__main__": from sklearn.datasets import make_regression from sklearn.kernel_ridge import KernelRidge import time # Generate synthetic data X, y = make_regression(n_samples=5000, n_features=10, noise=0.1) # Exact kernel ridge regression print("Exact Kernel Ridge Regression:") start = time.time() krr = KernelRidge(alpha=1.0, kernel='rbf', gamma=0.1) krr.fit(X, y) exact_time = time.time() - start print(f" Training time: {exact_time:.2f}s") # RFF approximation print("\nRFF Ridge Regression (D=500):") start = time.time() rff_model = RFFRidgeRegression(n_components=500, gamma=0.1, alpha=1.0) rff_model.fit(X, y) rff_time = time.time() - start print(f" Training time: {rff_time:.2f}s") print(f" Speedup: {exact_time/rff_time:.1f}x")How well does the RFF approximation work? This is crucial for understanding when and how to use Random Fourier Features.
Pointwise approximation error:
For a single kernel evaluation, the approximation error is:
$$\hat{k}(\mathbf{x}, \mathbf{x}') - k(\mathbf{x}, \mathbf{x}')$$
Since $\hat{k}$ is a Monte Carlo estimator of $k$, the error is a random variable with:
$$\mathbb{E}[\hat{k}(\mathbf{x}, \mathbf{x}')] = k(\mathbf{x}, \mathbf{x}') \quad \text{(unbiased)}$$
$$\text{Var}[\hat{k}(\mathbf{x}, \mathbf{x}')] = \frac{1}{D} \text{Var}[\phi_{\boldsymbol{\omega}}(\mathbf{x})^T \phi_{\boldsymbol{\omega}}(\mathbf{x}')]$$
The variance decreases as $O(1/D)$—more random features means better approximation.
Rahimi & Recht (2007) showed that with probability at least $1 - \delta$:
$$\sup_{\mathbf{x}, \mathbf{x}'} |\hat{k}(\mathbf{x}, \mathbf{x}') - k(\mathbf{x}, \mathbf{x}')| \leq O\left(\sqrt{\frac{\log(1/\delta)}{D}}\right)$$
The approximation is uniform over all input pairs! This is much stronger than pointwise convergence.
Practical implications for $D$ selection:
The error scales as $O(1/\sqrt{D})$, so to halve the approximation error, you need to quadruple $D$. Let's quantify this:
| $D$ | Typical Relative Error | Memory (n=100k, 64-bit) | Good For |
|---|---|---|---|
| 100 | ~10-20% | 80 MB | Rough approximation, prototyping |
| 500 | ~5-10% | 400 MB | General use, moderate accuracy |
| 1,000 | ~3-5% | 800 MB | Standard applications |
| 5,000 | ~1-2% | 4 GB | High-accuracy requirements |
| 10,000 | ~0.5-1% | 8 GB | Near-exact approximation |
Error in the final prediction:
The kernel approximation error propagates to the learned model and predictions. Let $\hat{f}$ be the function learned with RFF and $f^*$ be the exact kernel method solution. Under suitable conditions:
$$|\hat{f} - f^*|_{L^2} \leq O\left( \frac{1}{\sqrt{D}} + \sqrt{\frac{D}{n}} \right)$$
This reveals a tradeoff:
The optimal $D$ balances these, typically $D = O(\sqrt{n})$ or $D = O(n^{1/3})$ depending on the smoothness of the target function.
From kernel approximation to generalization:
The RFF approximation affects generalization through two mechanisms:
In practice, moderate $D$ (a few thousand) often achieves both good approximation and good generalization, since the regularization in ridge regression mitigates overfitting from using many features.
Start with $D \approx \sqrt{n}$ as a baseline. For $n = 10,000$, this gives $D \approx 100$. Then increase $D$ by factors of 2-4 until cross-validation error stabilizes. For many problems, $D = 1,000-5,000$ is sufficient regardless of $n$.
The power of Random Fourier Features lies in the dramatic complexity reduction they enable. Let's compare exact kernel methods with RFF-based approximations.
Training complexity:
| Method | Time Complexity | Space Complexity |
|---|---|---|
| Exact Kernel Ridge | $O(n^3)$ | $O(n^2)$ |
| RFF Ridge Regression | $O(nDd + D^3)$ | $O(nD + D^2)$ |
| RFF (D << n) | $O(nDd)$ | $O(nD)$ |
Concrete comparison (n = 100,000, d = 50, D = 1,000):
| Metric | Exact Kernel | RFF Approximation | Improvement |
|---|---|---|---|
| Training time | $10^{15}$ ops (~28 hours) | $5 \times 10^{9}$ ops (~0.5 seconds) | ~200,000× |
| Memory (matrix) | 80 GB | 800 MB | 100× |
| Prediction time (1 point) | 100,000 kernel evals | 1,000-dim dot product | ~100× |
RFF transforms kernel methods from impractical to trivial at large scales. A problem that would take a day becomes sub-second. A problem that was impossible (million points) becomes routine. This is why RFF revitalized kernel methods for modern datasets.
Prediction complexity:
The improvement at prediction time is equally dramatic:
For $D << n$, RFF predictions are much faster. More importantly, RFF prediction time is independent of training set size—like a parametric model.
The fundamental shift:
RFF converts a non-parametric method (where model complexity grows with data) into a parametric one (fixed $D$ parameters). This is a fundamental change in the computational model:
| Aspect | Exact Kernel | RFF |
|---|---|---|
| Training | $O(n^3)$ | $O(nD^2)$ |
| Prediction | $O(n)$ per point | $O(D)$ per point |
| Storage | $O(n^2)$ kernel + $O(n)$ weights | $O(D \cdot d)$ frequencies + $O(D)$ weights |
| Model grows with | Training data size | Fixed at $D$ |
Implementing RFF correctly requires attention to several practical details that can significantly impact performance.
1. Kernel parameter matching:
The spectral distribution must match the kernel exactly. For the RBF kernel:
$$k(\mathbf{x}, \mathbf{x}') = \exp\left(-\gamma |\mathbf{x} - \mathbf{x}'|^2\right) = \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2}\right)$$
where $\gamma = \frac{1}{2\sigma^2}$. The spectral distribution is $p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, 2\gamma \mathbf{I}) = \mathcal{N}(\mathbf{0}, \frac{1}{\sigma^2}\mathbf{I})$.
Common mistake: Using $\mathcal{N}(\mathbf{0}, \gamma \mathbf{I})$ instead of $\mathcal{N}(\mathbf{0}, 2\gamma \mathbf{I})$ gives the wrong kernel!
Different libraries use different RBF kernel parameterizations. Scikit-learn uses $\exp(-\gamma |\cdot|^2)$; GPy uses $\exp(-|\cdot|^2 / 2\ell^2)$. When implementing RFF, carefully match your spectral distribution to your kernel's parameterization.
2. Numerical precision:
RFF features involve trigonometric functions of potentially large arguments:
$$\cos(\boldsymbol{\omega}^T \mathbf{x})$$
If $|\boldsymbol{\omega}|$ and $|\mathbf{x}|$ are large, the argument grows beyond where floating-point cosine/sine are accurate. Mitigation strategies:
3. Memory-efficient implementation:
For very large $n$, even storing $\boldsymbol{\Phi} \in \mathbb{R}^{n \times D}$ may be problematic. Solutions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom scipy.linalg import solve def rff_ridge_minibatch(X, y, n_components, gamma, alpha, batch_size=1000, random_state=None): """ Memory-efficient RFF Ridge Regression using minibatch aggregation. Never stores the full feature matrix; instead accumulates Phi^T @ Phi and Phi^T @ y incrementally. Parameters: ----------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) n_components : int Number of random Fourier features. gamma : float RBF kernel bandwidth. alpha : float Ridge regularization strength. batch_size : int Number of samples per batch. random_state : int Random seed. Returns: -------- W : ndarray Random frequencies (for prediction) b : ndarray Random offsets (for prediction) coef : ndarray Learned coefficients """ n_samples, n_features = X.shape rng = np.random.RandomState(random_state) # Sample random frequencies and offsets W = rng.randn(n_features, n_components) * np.sqrt(2 * gamma) b = rng.uniform(0, 2 * np.pi, size=n_components) # Initialize accumulators PhiT_Phi = np.zeros((n_components, n_components)) # D x D PhiT_y = np.zeros(n_components) # D # Process in batches n_batches = (n_samples + batch_size - 1) // batch_size for i in range(n_batches): start = i * batch_size end = min((i + 1) * batch_size, n_samples) X_batch = X[start:end] y_batch = y[start:end] # Compute RFF features for this batch projection = X_batch @ W + b # (batch, D) Phi_batch = np.cos(projection) * np.sqrt(2.0 / n_components) # Accumulate sufficient statistics PhiT_Phi += Phi_batch.T @ Phi_batch PhiT_y += Phi_batch.T @ y_batch # Solve regularized least squares A = PhiT_Phi + alpha * np.eye(n_components) coef = solve(A, PhiT_y, assume_a='pos') return W, b, coef def rff_predict(X_test, W, b, coef): """Make predictions using fitted RFF model.""" n_components = W.shape[1] projection = X_test @ W + b Phi_test = np.cos(projection) * np.sqrt(2.0 / n_components) return Phi_test @ coefThe Random Fourier Feature framework has inspired numerous extensions that improve quality, efficiency, or applicability.
1. Orthogonal Random Features (ORF):
Standard RFF uses i.i.d. random frequencies, which can be redundant. Orthogonal Random Features enforce orthogonality among the frequency vectors:
$$\mathbf{W} = \mathbf{S} \cdot \mathbf{Q}$$
where $\mathbf{Q}$ is sampled from the Haar measure on orthogonal matrices (uniformly random orthogonal matrix), and $\mathbf{S}$ is a diagonal matrix with entries from the chi distribution.
Result: Same computational cost, lower variance, better approximation with the same $D$.
| Variant | Key Idea | Advantage |
|---|---|---|
| Orthogonal RF (ORF) | Use orthogonal frequency matrix | ~3× variance reduction |
| Quadrature Fourier Features | Deterministic frequency grids | No randomness, consistent results |
| Structured ORF (SORF) | Fast transforms (Hadamard, FFT) | O(D log D) instead of O(D²) for transform |
| Generalized Spectral Kernels | Optimize spectral distribution | Better approximation for specific tasks |
| Deep kernel learning | Learn frequencies from data | Data-adaptive feature spaces |
2. Beyond shift-invariant kernels:
RFF relies on Bochner's theorem, which applies only to shift-invariant kernels. For other kernels:
3. Random Kitchen Sinks and beyond:
The original RFF paper called the approach "Random Kitchen Sinks"—throw random stuff together and it works! This philosophy inspired:
4. Learned features:
Instead of sampling frequencies randomly, we can learn them:
$$\boldsymbol{\phi}(\mathbf{x}; \boldsymbol{\theta}) = \sqrt{\frac{1}{D}} [\cos(\mathbf{W}(\boldsymbol{\theta})^T \mathbf{x} + \mathbf{b}(\boldsymbol{\theta}))]$$
Optimizing $\boldsymbol{\theta}$ jointly with the linear weights creates a two-layer neural network with fixed activation patterns. This bridges RFF with deep learning.
RFF can be viewed as a single-layer neural network with cosine activation and random weights. Deep kernel learning generalizes this to multiple layers with learned weights. The boundary between 'kernel methods' and 'neural networks' is blurrier than it appears!
Random Fourier Features represent one of the most important practical advances in kernel methods, making them viable for modern large-scale applications.
What's next:
While RFF provides a powerful general-purpose approximation, alternative approaches offer complementary advantages. Next, we explore the Nyström approximation, which approximates the kernel matrix directly using a subset of training points—a fundamentally different strategy with its own strengths and trade-offs.
You now understand Random Fourier Features from theoretical foundations (Bochner's theorem) through practical implementation details. This technique is a powerful tool in your kernel methods arsenal, enabling you to scale to datasets that would otherwise be completely intractable.