Computational Aspects Of Kernels - Learning Module

Loading content...

0/245

Random Fourier Features

A Revolutionary Insight

In 2007, Ali Rahimi and Benjamin Recht published a paper that fundamentally changed how we think about kernel methods. Their key insight was elegant and surprising:

Shift-invariant kernels can be approximated by explicit, finite-dimensional random feature mappings.

This means that instead of computing the $n \times n$ kernel matrix—with its quadratic memory and cubic training costs—we can transform our data into a $D$-dimensional feature space and use standard linear methods. The computational complexity drops from $O(n^3)$ to $O(nD^2)$, and with $D << n$, kernel methods become feasible for datasets of virtually unlimited size.

This technique, known as Random Fourier Features (RFF), bridges the gap between kernel methods and linear models, combining the expressiveness of the former with the scalability of the latter.

What You Will Learn

By the end of this page, you will understand Bochner's theorem and its connection to kernels, how to derive the Random Fourier Feature mapping for RBF kernels, the mathematical guarantees on approximation quality, practical implementation considerations, and extensions to other kernel types.

The Theoretical Foundation — Bochner's Theorem

The mathematical foundation of Random Fourier Features is Bochner's theorem, a classical result from harmonic analysis that connects positive definite functions to Fourier transforms.

Shift-invariant kernels:

A kernel $k(\mathbf{x}, \mathbf{x}')$ is shift-invariant (or stationary) if it depends only on the difference $\boldsymbol{\delta} = \mathbf{x} - \mathbf{x}'$:

$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\delta})$$

Examples of shift-invariant kernels include:

RBF (Gaussian): $k(\boldsymbol{\delta}) = \exp(-\gamma |\boldsymbol{\delta}|^2)$
Laplacian: $k(\boldsymbol{\delta}) = \exp(-\gamma |\boldsymbol{\delta}|_1)$
Matérn family: Various forms based on distance

Crucially, the polynomial and linear kernels are not shift-invariant—they depend on absolute positions, not just differences.

Bochner's Theorem

A continuous shift-invariant kernel $k(\boldsymbol{\delta})$ is positive definite if and only if it is the Fourier transform of a non-negative measure $p(\boldsymbol{\omega})$:

$$k(\boldsymbol{\delta}) = \int_{\mathbb{R}^d} p(\boldsymbol{\omega}) e^{i \boldsymbol{\omega}^T \boldsymbol{\delta}} d\boldsymbol{\omega}$$

If $k(\mathbf{0}) = 1$ (kernel normalized), then $p(\boldsymbol{\omega})$ is a proper probability distribution.

Understanding the theorem:

Bochner's theorem tells us that every valid shift-invariant kernel has a spectral representation—it can be written as an expectation over random frequencies:

$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim p} \left[ e^{i \boldsymbol{\omega}^T (\mathbf{x} - \mathbf{x}')} \right]$$

Since the kernel is real-valued, we can use the real part:

$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim p} \left[ \cos(\boldsymbol{\omega}^T (\mathbf{x} - \mathbf{x}')) \right]$$

Using the trigonometric identity $\cos(a - b) = \cos(a)\cos(b) + \sin(a)\sin(b)$:

$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim p} \left[ \cos(\boldsymbol{\omega}^T \mathbf{x})\cos(\boldsymbol{\omega}^T \mathbf{x}') + \sin(\boldsymbol{\omega}^T \mathbf{x})\sin(\boldsymbol{\omega}^T \mathbf{x}') \right]$$

This is an inner product! Define the random feature map:

$$\phi_{\boldsymbol{\omega}}(\mathbf{x}) = \begin{bmatrix} \cos(\boldsymbol{\omega}^T \mathbf{x}) \ \sin(\boldsymbol{\omega}^T \mathbf{x}) \end{bmatrix}$$

Then: $k(\mathbf{x}, \mathbf{x}') = \mathbb{E}{\boldsymbol{\omega}} [\phi{\boldsymbol{\omega}}(\mathbf{x})^T \phi_{\boldsymbol{\omega}}(\mathbf{x}')]$

Spectral Distributions for Common Kernels
Kernel	Formula	Spectral Distribution $p(\boldsymbol{\omega})$
RBF (Gaussian)	$\exp(-\gamma\|\boldsymbol{\delta}\|^2)$	$\mathcal{N}(\mathbf{0}, 2\gamma \mathbf{I})$ — Gaussian with variance $2\gamma$
Laplacian	$\exp(-\gamma\|\boldsymbol{\delta}\|_1)$	Product of Cauchy distributions with scale $\gamma$
Matérn ($\nu = 1/2$)	$\exp(-\gamma\|\boldsymbol{\delta}\|)$	Student-t with appropriate degrees of freedom
Cauchy	$\prod_j \frac{1}{1+\gamma^2 \delta_j^2}$	Product of Laplace distributions

The Random Fourier Feature Approximation

The key insight of Random Fourier Features is to approximate the expectation with a Monte Carlo average. Instead of integrating over all possible frequencies, we sample a finite set and average:

$$k(\mathbf{x}, \mathbf{x}') \approx \hat{k}(\mathbf{x}, \mathbf{x}') = \frac{1}{D} \sum_{j=1}^{D} \phi_{\boldsymbol{\omega}j}(\mathbf{x})^T \phi{\boldsymbol{\omega}_j}(\mathbf{x}')$$

where $\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \overset{\text{i.i.d.}}{\sim} p(\boldsymbol{\omega})$.

The explicit feature map:

We can rewrite this as an inner product of finite-dimensional features:

$$\hat{k}(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^T \boldsymbol{\phi}(\mathbf{x}')$$

where the Random Fourier Feature map $\boldsymbol{\phi}: \mathbb{R}^d \to \mathbb{R}^{2D}$ is:

$$\boldsymbol{\phi}(\mathbf{x}) = \sqrt{\frac{1}{D}} \begin{bmatrix} \cos(\boldsymbol{\omega}_1^T \mathbf{x}) \ \sin(\boldsymbol{\omega}_1^T \mathbf{x}) \ \vdots \ \cos(\boldsymbol{\omega}_D^T \mathbf{x}) \ \sin(\boldsymbol{\omega}_D^T \mathbf{x}) \end{bmatrix}$$

The Key Transformation

Random Fourier Features transform a kernel machine into a linear model! After computing $\boldsymbol{\phi}(\mathbf{x})$ for each training point, we have an explicit feature matrix $\boldsymbol{\Phi} \in \mathbb{R}^{n \times 2D}$, and we can use standard linear regression or classification methods.

Alternative formulation with random offsets:

A slightly different but equivalent formulation uses random phase offsets instead of paired sine/cosine:

$$\phi_j(\mathbf{x}) = \sqrt{\frac{2}{D}} \cos(\boldsymbol{\omega}_j^T \mathbf{x} + b_j)$$

where $b_j \sim \text{Uniform}[0, 2\pi]$. This gives $D$-dimensional features (not $2D$) with the same approximation quality.

The complete algorithm:

Sample frequencies: Draw $\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega})$ (Gaussian for RBF kernel)
Sample offsets (optional): Draw $b_1, \ldots, b_D \sim \text{Uniform}[0, 2\pi]$
Transform training data: Compute $\boldsymbol{\phi}(\mathbf{x}_i)$ for all training points
Train linear model: Fit linear regression/classification on $\boldsymbol{\Phi}, \mathbf{y}$
Predict: Transform test points, apply linear model

Steps 1-2 happen once; step 3 is $O(nDd)$; step 4 is $O(nD^2 + D^3)$ for ridge regression.

random_fourier_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from scipy.linalg import solve
 
class RandomFourierFeatures:
    """
    Random Fourier Features for RBF kernel approximation.
    
    Transforms data into explicit features such that 
    phi(x).T @ phi(x') ≈ exp(-gamma * ||x - x'||^2)
    
    Parameters:
    -----------
    n_components : int
        Number of random features (D). Higher = better approximation.
    gamma : float
        RBF kernel bandwidth parameter.
    use_offset : bool
        If True, use cos(w.T @ x + b) formulation (D features).
        If False, use [cos(w.T @ x), sin(w.T @ x)] formulation (2D features).
    random_state : int or None
        Random seed for reproducibility.
    """
    
    def __init__(self, n_components=100, gamma=1.0, use_offset=True, random_state=None):
        self.n_components = n_components
        self.gamma = gamma
        self.use_offset = use_offset
        self.random_state = random_state
        self.W_ = None  # Random frequencies
        self.b_ = None  # Random offsets
        
    def fit(self, X):
        """
        Sample random frequencies from the spectral distribution.
        
        Parameters:
        -----------
        X : ndarray of shape (n_samples, n_features)
            Training data (only used to get dimensionality).
        
        Returns:
        --------
        self
        """
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        
        # For RBF kernel with exp(-gamma * ||x-x'||^2),
        # spectral distribution is N(0, 2*gamma * I)
        # Standard deviation is sqrt(2 * gamma)
        self.W_ = rng.randn(n_features, self.n_components) * np.sqrt(2 * self.gamma)
        
        if self.use_offset:
            self.b_ = rng.uniform(0, 2 * np.pi, size=self.n_components)
        
        return self
    
    def transform(self, X):
        """
        Apply the random feature mapping.
        
        Parameters:
        -----------
        X : ndarray of shape (n_samples, n_features)
            Input data.
        
        Returns:
        --------
        X_transformed : ndarray
            Transformed features.
        """
        # Compute W^T @ x for all samples: (n_samples, n_components)
        projection = X @ self.W_
        
        if self.use_offset:
            # cos(W^T @ x + b), D-dimensional output
            features = np.cos(projection + self.b_)
            return features * np.sqrt(2.0 / self.n_components)
        else:
            # [cos(W^T @ x), sin(W^T @ x)], 2D-dimensional output
            cos_features = np.cos(projection)
            sin_features = np.sin(projection)
            features = np.hstack([cos_features, sin_features])
            return features * np.sqrt(1.0 / self.n_components)
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        return self.fit(X).transform(X)
 
 
class RFFRidgeRegression:
    """
    Ridge regression with Random Fourier Features.
    
    Approximates kernel ridge regression with RBF kernel
    at dramatically reduced computational cost.
    """
    
    def __init__(self, n_components=100, gamma=1.0, alpha=1.0, random_state=None):
        self.n_components = n_components
        self.gamma = gamma
        self.alpha = alpha  # Regularization strength
        self.random_state = random_state
        self.rff_ = None
        self.coef_ = None
        
    def fit(self, X, y):
        """Fit RFF ridge regression model."""
        # Create and apply RFF transformation
        self.rff_ = RandomFourierFeatures(
            n_components=self.n_components,
            gamma=self.gamma,
            random_state=self.random_state
        )
        Phi = self.rff_.fit_transform(X)
        
        # Solve ridge regression: (Phi^T Phi + alpha I) w = Phi^T y
        n_features = Phi.shape[1]
        A = Phi.T @ Phi + self.alpha * np.eye(n_features)
        b = Phi.T @ y
        
        self.coef_ = solve(A, b, assume_a='pos')
        return self
    
    def predict(self, X):
        """Predict using the fitted model."""
        Phi = self.rff_.transform(X)
        return Phi @ self.coef_
 
 
# Example usage comparison
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.kernel_ridge import KernelRidge
    import time
    
    # Generate synthetic data
    X, y = make_regression(n_samples=5000, n_features=10, noise=0.1)
    
    # Exact kernel ridge regression
    print("Exact Kernel Ridge Regression:")
    start = time.time()
    krr = KernelRidge(alpha=1.0, kernel='rbf', gamma=0.1)
    krr.fit(X, y)
    exact_time = time.time() - start
    print(f"  Training time: {exact_time:.2f}s")
    
    # RFF approximation
    print("\nRFF Ridge Regression (D=500):")
    start = time.time()
    rff_model = RFFRidgeRegression(n_components=500, gamma=0.1, alpha=1.0)
    rff_model.fit(X, y)
    rff_time = time.time() - start
    print(f"  Training time: {rff_time:.2f}s")
    print(f"  Speedup: {exact_time/rff_time:.1f}x")

Approximation Error Analysis

How well does the RFF approximation work? This is crucial for understanding when and how to use Random Fourier Features.

Pointwise approximation error:

For a single kernel evaluation, the approximation error is:

$$\hat{k}(\mathbf{x}, \mathbf{x}') - k(\mathbf{x}, \mathbf{x}')$$

Since $\hat{k}$ is a Monte Carlo estimator of $k$, the error is a random variable with:

$$\mathbb{E}[\hat{k}(\mathbf{x}, \mathbf{x}')] = k(\mathbf{x}, \mathbf{x}') \quad \text{(unbiased)}$$

$$\text{Var}[\hat{k}(\mathbf{x}, \mathbf{x}')] = \frac{1}{D} \text{Var}[\phi_{\boldsymbol{\omega}}(\mathbf{x})^T \phi_{\boldsymbol{\omega}}(\mathbf{x}')]$$

The variance decreases as $O(1/D)$—more random features means better approximation.

Concentration Bounds

Rahimi & Recht (2007) showed that with probability at least $1 - \delta$:

$$\sup_{\mathbf{x}, \mathbf{x}'} |\hat{k}(\mathbf{x}, \mathbf{x}') - k(\mathbf{x}, \mathbf{x}')| \leq O\left(\sqrt{\frac{\log(1/\delta)}{D}}\right)$$

The approximation is uniform over all input pairs! This is much stronger than pointwise convergence.

Practical implications for $D$ selection:

The error scales as $O(1/\sqrt{D})$, so to halve the approximation error, you need to quadruple $D$. Let's quantify this:

RFF Approximation Quality vs. Number of Features
$D$	Typical Relative Error	Memory (n=100k, 64-bit)	Good For
100	~10-20%	80 MB	Rough approximation, prototyping
500	~5-10%	400 MB	General use, moderate accuracy
1,000	~3-5%	800 MB	Standard applications
5,000	~1-2%	4 GB	High-accuracy requirements
10,000	~0.5-1%	8 GB	Near-exact approximation

Error in the final prediction:

The kernel approximation error propagates to the learned model and predictions. Let $\hat{f}$ be the function learned with RFF and $f^*$ be the exact kernel method solution. Under suitable conditions:

$$|\hat{f} - f^*|_{L^2} \leq O\left( \frac{1}{\sqrt{D}} + \sqrt{\frac{D}{n}} \right)$$

This reveals a tradeoff:

Too small $D$: High kernel approximation error (first term)
Too large $D$: Statistical error from estimating too many parameters (second term)

The optimal $D$ balances these, typically $D = O(\sqrt{n})$ or $D = O(n^{1/3})$ depending on the smoothness of the target function.

From kernel approximation to generalization:

The RFF approximation affects generalization through two mechanisms:

Bias: The approximate feature space might not contain the true function, introducing bias
Variance: Fewer features than required increases variance; too many features increases overfitting risk

In practice, moderate $D$ (a few thousand) often achieves both good approximation and good generalization, since the regularization in ridge regression mitigates overfitting from using many features.

Practical Guidance

Start with $D \approx \sqrt{n}$ as a baseline. For $n = 10,000$, this gives $D \approx 100$. Then increase $D$ by factors of 2-4 until cross-validation error stabilizes. For many problems, $D = 1,000-5,000$ is sufficient regardless of $n$.

Computational Complexity Comparison

The power of Random Fourier Features lies in the dramatic complexity reduction they enable. Let's compare exact kernel methods with RFF-based approximations.

Training complexity:

Training Complexity: Exact vs. RFF (n samples, d input features, D RFF)
Method	Time Complexity	Space Complexity
Exact Kernel Ridge	$O(n^3)$	$O(n^2)$
RFF Ridge Regression	$O(nDd + D^3)$	$O(nD + D^2)$
RFF (D << n)	$O(nDd)$	$O(nD)$

Concrete comparison (n = 100,000, d = 50, D = 1,000):

Concrete Resource Comparison
Metric	Exact Kernel	RFF Approximation	Improvement
Training time	$10^{15}$ ops (~28 hours)	$5 \times 10^{9}$ ops (~0.5 seconds)	~200,000×
Memory (matrix)	80 GB	800 MB	100×
Prediction time (1 point)	100,000 kernel evals	1,000-dim dot product	~100×

The Scalability Breakthrough

RFF transforms kernel methods from impractical to trivial at large scales. A problem that would take a day becomes sub-second. A problem that was impossible (million points) becomes routine. This is why RFF revitalized kernel methods for modern datasets.

Prediction complexity:

The improvement at prediction time is equally dramatic:

Exact kernel: $O(n \cdot d)$ per prediction (compute kernel with all training points)
RFF: $O(D \cdot d + D) = O(D \cdot d)$ per prediction (transform input, dot product with weights)

For $D << n$, RFF predictions are much faster. More importantly, RFF prediction time is independent of training set size—like a parametric model.

The fundamental shift:

RFF converts a non-parametric method (where model complexity grows with data) into a parametric one (fixed $D$ parameters). This is a fundamental change in the computational model:

Aspect	Exact Kernel	RFF
Training	$O(n^3)$	$O(nD^2)$
Prediction	$O(n)$ per point	$O(D)$ per point
Storage	$O(n^2)$ kernel + $O(n)$ weights	$O(D \cdot d)$ frequencies + $O(D)$ weights
Model grows with	Training data size	Fixed at $D$

Practical Implementation Details

Implementing RFF correctly requires attention to several practical details that can significantly impact performance.

1. Kernel parameter matching:

The spectral distribution must match the kernel exactly. For the RBF kernel:

$$k(\mathbf{x}, \mathbf{x}') = \exp\left(-\gamma |\mathbf{x} - \mathbf{x}'|^2\right) = \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2}\right)$$

where $\gamma = \frac{1}{2\sigma^2}$. The spectral distribution is $p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, 2\gamma \mathbf{I}) = \mathcal{N}(\mathbf{0}, \frac{1}{\sigma^2}\mathbf{I})$.

Common mistake: Using $\mathcal{N}(\mathbf{0}, \gamma \mathbf{I})$ instead of $\mathcal{N}(\mathbf{0}, 2\gamma \mathbf{I})$ gives the wrong kernel!

Scaling Convention Mismatch

Different libraries use different RBF kernel parameterizations. Scikit-learn uses $\exp(-\gamma |\cdot|^2)$; GPy uses $\exp(-|\cdot|^2 / 2\ell^2)$. When implementing RFF, carefully match your spectral distribution to your kernel's parameterization.

2. Numerical precision:

RFF features involve trigonometric functions of potentially large arguments:

$$\cos(\boldsymbol{\omega}^T \mathbf{x})$$

If $|\boldsymbol{\omega}|$ and $|\mathbf{x}|$ are large, the argument grows beyond where floating-point cosine/sine are accurate. Mitigation strategies:

Standardize inputs: Center and scale $\mathbf{x}$ to have zero mean and unit variance
Use double precision: 64-bit floats provide sufficient range for typical applications
Monitor for warnings: NaN or Inf values indicate numerical issues

3. Memory-efficient implementation:

For very large $n$, even storing $\boldsymbol{\Phi} \in \mathbb{R}^{n \times D}$ may be problematic. Solutions:

Minibatch training: Process data in chunks, computing $\boldsymbol{\Phi}^T\boldsymbol{\Phi}$ and $\boldsymbol{\Phi}^T\mathbf{y}$ incrementally
Online/streaming: For SGD-based learning, compute features on-the-fly without storing $\boldsymbol{\Phi}$
GPU acceleration: Trigonometric operations parallelize well on GPUs

rff_minibatch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from scipy.linalg import solve
 
def rff_ridge_minibatch(X, y, n_components, gamma, alpha, batch_size=1000, random_state=None):
    """
    Memory-efficient RFF Ridge Regression using minibatch aggregation.
    
    Never stores the full feature matrix; instead accumulates 
    Phi^T @ Phi and Phi^T @ y incrementally.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    n_components : int
        Number of random Fourier features.
    gamma : float
        RBF kernel bandwidth.
    alpha : float
        Ridge regularization strength.
    batch_size : int
        Number of samples per batch.
    random_state : int
        Random seed.
    
    Returns:
    --------
    W : ndarray
        Random frequencies (for prediction)
    b : ndarray
        Random offsets (for prediction)
    coef : ndarray
        Learned coefficients
    """
    n_samples, n_features = X.shape
    rng = np.random.RandomState(random_state)
    
    # Sample random frequencies and offsets
    W = rng.randn(n_features, n_components) * np.sqrt(2 * gamma)
    b = rng.uniform(0, 2 * np.pi, size=n_components)
    
    # Initialize accumulators
    PhiT_Phi = np.zeros((n_components, n_components))  # D x D
    PhiT_y = np.zeros(n_components)  # D
    
    # Process in batches
    n_batches = (n_samples + batch_size - 1) // batch_size
    for i in range(n_batches):
        start = i * batch_size
        end = min((i + 1) * batch_size, n_samples)
        
        X_batch = X[start:end]
        y_batch = y[start:end]
        
        # Compute RFF features for this batch
        projection = X_batch @ W + b  # (batch, D)
        Phi_batch = np.cos(projection) * np.sqrt(2.0 / n_components)
        
        # Accumulate sufficient statistics
        PhiT_Phi += Phi_batch.T @ Phi_batch
        PhiT_y += Phi_batch.T @ y_batch
    
    # Solve regularized least squares
    A = PhiT_Phi + alpha * np.eye(n_components)
    coef = solve(A, PhiT_y, assume_a='pos')
    
    return W, b, coef
 
 
def rff_predict(X_test, W, b, coef):
    """Make predictions using fitted RFF model."""
    n_components = W.shape[1]
    projection = X_test @ W + b
    Phi_test = np.cos(projection) * np.sqrt(2.0 / n_components)
    return Phi_test @ coef

Implementation Best Practices

•Verify kernel approximation: Before training, compare RFF kernel estimates to true kernel on a subsample. Ensure $|\hat{k} - k| < 0.05$ for representative pairs.
•Use standardized inputs: Zero-mean, unit-variance inputs improve numerical stability and consistency with kernel parameters.
•Store frequencies for prediction: The random frequencies $\mathbf{W}$ and offsets $\mathbf{b}$ must be saved—they define the feature space.
•Cross-validate $D$: Treat number of features as a hyperparameter. More isn't always better due to overfitting.
•Consider orthogonal random features: Orthogonalizing $\mathbf{W}$ (using structured matrices) can reduce variance with the same $D$.

Extensions and Variants

The Random Fourier Feature framework has inspired numerous extensions that improve quality, efficiency, or applicability.

1. Orthogonal Random Features (ORF):

Standard RFF uses i.i.d. random frequencies, which can be redundant. Orthogonal Random Features enforce orthogonality among the frequency vectors:

$$\mathbf{W} = \mathbf{S} \cdot \mathbf{Q}$$

where $\mathbf{Q}$ is sampled from the Haar measure on orthogonal matrices (uniformly random orthogonal matrix), and $\mathbf{S}$ is a diagonal matrix with entries from the chi distribution.

Result: Same computational cost, lower variance, better approximation with the same $D$.

RFF Variants and Extensions
Variant	Key Idea	Advantage
Orthogonal RF (ORF)	Use orthogonal frequency matrix	~3× variance reduction
Quadrature Fourier Features	Deterministic frequency grids	No randomness, consistent results
Structured ORF (SORF)	Fast transforms (Hadamard, FFT)	O(D log D) instead of O(D²) for transform
Generalized Spectral Kernels	Optimize spectral distribution	Better approximation for specific tasks
Deep kernel learning	Learn frequencies from data	Data-adaptive feature spaces

2. Beyond shift-invariant kernels:

RFF relies on Bochner's theorem, which applies only to shift-invariant kernels. For other kernels:

Polynomial kernel: Use TensorSketch or Random Maclaurin Features
Arc-cosine kernels: Direct random feature construction exists
Additive kernels: Hierarchical random feature approaches

3. Random Kitchen Sinks and beyond:

The original RFF paper called the approach "Random Kitchen Sinks"—throw random stuff together and it works! This philosophy inspired:

Extreme Learning Machines: Random hidden layer + linear output
Reservoir Computing: Random recurrent dynamics + linear readout
Neural Network random features: Random first-layer weights, trained last layer

4. Learned features:

Instead of sampling frequencies randomly, we can learn them:

$$\boldsymbol{\phi}(\mathbf{x}; \boldsymbol{\theta}) = \sqrt{\frac{1}{D}} [\cos(\mathbf{W}(\boldsymbol{\theta})^T \mathbf{x} + \mathbf{b}(\boldsymbol{\theta}))]$$

Optimizing $\boldsymbol{\theta}$ jointly with the linear weights creates a two-layer neural network with fixed activation patterns. This bridges RFF with deep learning.

The RFF-Neural Network Connection

RFF can be viewed as a single-layer neural network with cosine activation and random weights. Deep kernel learning generalizes this to multiple layers with learned weights. The boundary between 'kernel methods' and 'neural networks' is blurrier than it appears!

Summary: The Power of Random Fourier Features

Random Fourier Features represent one of the most important practical advances in kernel methods, making them viable for modern large-scale applications.

Key Takeaways

•Bochner's theorem establishes that shift-invariant kernels are Fourier transforms of spectral distributions—enabling frequency-space representation.
•Monte Carlo approximation samples random frequencies to create explicit, finite-dimensional features where inner products approximate kernel evaluations.
•Complexity reduction is dramatic: from $O(n^3)$ to $O(nD^2)$ training, from $O(n)$ to $O(D)$ prediction per point, making million-point datasets tractable.
•Approximation quality improves as $O(1/\sqrt{D})$; practical applications typically use $D = 1,000-10,000$ features.
•Implementation details matter: correct spectral distribution, numerical stability, and memory efficiency require care.
•Extensions abound: orthogonal features, structured transforms, and learned features continue to improve the approach.

What's next:

While RFF provides a powerful general-purpose approximation, alternative approaches offer complementary advantages. Next, we explore the Nyström approximation, which approximates the kernel matrix directly using a subset of training points—a fundamentally different strategy with its own strengths and trade-offs.

Page Complete

You now understand Random Fourier Features from theoretical foundations (Bochner's theorem) through practical implementation details. This technique is a powerful tool in your kernel methods arsenal, enabling you to scale to datasets that would otherwise be completely intractable.