Machine LearningGaussian Processes

Scalable Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Random Features

Approximating Kernels with Finite Dimensions

The scalable GP methods we have explored so far—sparse approximations and variational inference—reduce complexity by compressing the GP through a small set of inducing points. Random feature methods take a fundamentally different approach: rather than approximating the GP structure, they approximate the kernel function itself.

The core insight, formalized in Rahimi and Recht's seminal 2007 paper, is that many kernel functions can be written as expectations over explicit feature maps:

$$k(\mathbf{x}, \mathbf{x}') = \mathbb{E}{\omega \sim p(\omega)}[\phi\omega(\mathbf{x})^\top \phi_\omega(\mathbf{x}')]$$

By sampling D random "frequencies" $\{\omega_1, ..., \omega_D\}$ from the distribution $p(\omega)$, we obtain a Monte Carlo approximation:

$$k(\mathbf{x}, \mathbf{x}') \approx \hat{\phi}(\mathbf{x})^\top \hat{\phi}(\mathbf{x}')$$

where $\hat{\phi}(\mathbf{x}) \in \mathbb{R}^D$ is an explicit, low-dimensional feature vector.

This transformation is revolutionary: it converts kernel methods, whose complexity scales with data size, into standard linear methods operating on D-dimensional features. GP inference becomes linear regression in feature space, reducing O(N³) to O(ND² + D³)—a remarkable improvement when D << N.

What You Will Learn

By the end of this page, you will understand the Bochner's theorem foundation that enables random features, master Random Fourier Features (RFF) for stationary kernels, explore advanced variants like Orthogonal Random Features and Quadrature Fourier Features, and appreciate both the power and limitations of this approach for GP scalability.

Bochner's Theorem: The Theoretical Foundation

The mathematical foundation for random features comes from Bochner's theorem, a foundational result in harmonic analysis that characterizes positive-definite functions.

Bochner's Theorem (1933):

A continuous, stationary kernel $k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\tau})$ is positive-definite if and only if it is the Fourier transform of a non-negative measure $p(\boldsymbol{\omega})$:

$$k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{i \boldsymbol{\omega}^\top \boldsymbol{\tau}} p(\boldsymbol{\omega}) d\boldsymbol{\omega}$$

When $k(\mathbf{0}) = 1$ (normalized kernel), $p(\boldsymbol{\omega})$ is a proper probability distribution called the spectral density of the kernel.

Key insight:

The kernel can be written as an expectation:

$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}{\boldsymbol{\omega} \sim p}\left[e^{i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')}\right] = \mathbb{E}{\boldsymbol{\omega} \sim p}\left[e^{i \boldsymbol{\omega}^\top \mathbf{x}}\overline{e^{i \boldsymbol{\omega}^\top \mathbf{x}'}}\right]$$

where $\overline{\cdot}$ denotes complex conjugate.

For real-valued kernels, we can use the real part of the exponential:

$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega}, b}\left[\sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x} + b) \cdot \sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x}' + b)\right]$$

where $b \sim \text{Uniform}[0, 2\pi)$.

Spectral Densities for Common Stationary Kernels
Kernel	k(τ)	Spectral Density p(ω)
Squared Exponential (RBF)	exp(-\|\|τ\|\|²/2ℓ²)	N(0, ℓ⁻²I) — Gaussian
Matérn (ν=1/2, Exp)	exp(-\|\|τ\|\|/ℓ)	Cauchy: (1 + ℓ²\|\|ω\|\|²)⁻¹
Matérn (ν=3/2)	(1 + √3\|\|τ\|\|/ℓ)exp(-√3\|\|τ\|\|/ℓ)	Student-t-like
Matérn (ν→∞)	exp(-\|\|τ\|\|²/2ℓ²)	Gaussian (same as SE)
Periodic	exp(-2sin²(π\|\|τ\|\|/p)/ℓ²)	Bessel functions

Stationary Kernels Only

Bochner's theorem applies only to stationary (translation-invariant) kernels where k(x, x') = k(x - x'). Non-stationary kernels like the linear kernel or dot-product kernels require different techniques. However, many practical applications use stationary kernels, making random features broadly applicable.

Random Fourier Features: The Algorithm

Random Fourier Features (RFF) directly implement the Monte Carlo approximation to Bochner's expectation:

Algorithm:

Random Fourier Features Algorithm

Input: Kernel k with spectral density p(ω), number of features D
Output: Feature map φ: R^d → R^D

1. Sample {ω_1, ..., ω_{D/2}} ~ p(ω)
2. Sample {b_1, ..., b_{D/2}} ~ Uniform[0, 2π)
3. Define feature map:
   φ(x) = sqrt(2/D) * [cos(ω_1^T x + b_1), sin(ω_1^T x + b_1),
                       cos(ω_2^T x + b_2), sin(ω_2^T x + b_2),
                       ...,
                       cos(ω_{D/2}^T x + b_{D/2}), sin(ω_{D/2}^T x + b_{D/2})]
4. Return φ

The kernel is then approximated as: $$\hat{k}(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')$$

For the Squared Exponential kernel with length-scale $\ell$:

Spectral density: $p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \ell^{-2}I)$
Sample: $\boldsymbol{\omega}_j \sim \mathcal{N}(\mathbf{0}, \ell^{-2}I)$
Feature: $\phi_j(\mathbf{x}) = \sqrt{\frac{2}{D}}\cos(\boldsymbol{\omega}_j^\top \mathbf{x} + b_j)$

random_fourier_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
import numpy as np
from scipy.stats import norm
 
class RandomFourierFeatures:
    """
    Random Fourier Features for kernel approximation.
    
    Implements the Rahimi & Recht (2007) algorithm for approximating
    stationary kernels via random sampling from the spectral density.
    """
    
    def __init__(self, d: int, D: int, kernel_type: str = 'rbf', 
                 lengthscale: float = 1.0, variance: float = 1.0, seed: int = None):
        """
        Args:
            d: Input dimensionality
            D: Number of random features (must be even for sin+cos)
            kernel_type: 'rbf' (squared exponential) or 'matern12' (exponential)
            lengthscale: Kernel length-scale parameter
            variance: Kernel variance (output scale)
            seed: Random seed for reproducibility
        """
        self.d = d
        self.D = D
        self.kernel_type = kernel_type
        self.lengthscale = lengthscale
        self.variance = variance
        
        rng = np.random.RandomState(seed)
        
        # Sample frequencies from spectral density
        if kernel_type == 'rbf':
            # SE kernel: p(ω) = N(0, ℓ^{-2}I)
            self.omega = rng.randn(D // 2, d) / lengthscale
        elif kernel_type == 'matern12':
            # Exponential kernel: p(ω) ∝ (1 + ℓ²||ω||²)^{-(d+1)/2}
            # This is a multivariate Cauchy (Student-t with ν=1)
            # Sample via: ω = z / sqrt(u) where z~N(0,I), u~χ²(1)
            z = rng.randn(D // 2, d)
            u = rng.chisquare(1, size=(D // 2, 1))
            self.omega = z / np.sqrt(u) / lengthscale
        else:
            raise ValueError(f"Unknown kernel type: {kernel_type}")
        
        # Sample phases
        self.b = rng.uniform(0, 2 * np.pi, size=D // 2)
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Compute random features φ(X).
        
        Args:
            X: (N, d) input points
        
        Returns:
            Phi: (N, D) random feature matrix
        """
        N = len(X)
        D = self.D
        
        # Compute ω^T x + b for all points and frequencies
        projection = X @ self.omega.T + self.b  # (N, D/2)
        
        # Compute cos and sin features
        Phi = np.zeros((N, D))
        Phi[:, 0::2] = np.cos(projection)  # Even indices
        Phi[:, 1::2] = np.sin(projection)  # Odd indices
        
        # Scale by sqrt(2/D) and kernel variance
        Phi *= np.sqrt(2.0 / D) * np.sqrt(self.variance)
        
        return Phi
    
    def kernel_approximation(self, X1: np.ndarray, X2: np.ndarray = None) -> np.ndarray:
        """
        Compute approximate kernel matrix K̂ = Φ(X1) @ Φ(X2)^T.
        """
        Phi1 = self.transform(X1)
        if X2 is None:
            return Phi1 @ Phi1.T
        Phi2 = self.transform(X2)
        return Phi1 @ Phi2.T
 
def demonstrate_rff_approximation():
    """Show how RFF approximation improves with D."""
    from numpy.linalg import norm
    
    np.random.seed(42)
    d = 5
    N = 200
    X = np.random.randn(N, d)
    
    # Exact SE kernel
    lengthscale = 1.0
    dist_sq = np.sum((X[:, None] - X[None, :])**2, axis=2)
    K_exact = np.exp(-0.5 * dist_sq / lengthscale**2)
    
    print("RFF Approximation Quality vs Number of Features")
    print("=" * 55)
    print(f"{'D':>8} | {'Relative Error':>15} | {'Approx Time':>12}")
    print("-" * 55)
    
    for D in [10, 50, 100, 500, 1000, 5000]:
        import time
        start = time.perf_counter()
        
        rff = RandomFourierFeatures(d=d, D=D, kernel_type='rbf', 
                                    lengthscale=lengthscale, seed=42)
        K_approx = rff.kernel_approximation(X)
        
        elapsed = time.perf_counter() - start
        rel_error = norm(K_exact - K_approx, 'fro') / norm(K_exact, 'fro')
        
        print(f"{D:>8} | {rel_error:>15.6f} | {elapsed*1000:>10.2f} ms")
 
class RFFGaussianProcess:
    """
    Gaussian Process regression using Random Fourier Features.
    
    Converts GP inference to Bayesian linear regression in feature space,
    reducing complexity from O(N³) to O(ND² + D³).
    """
    
    def __init__(self, d: int, D: int, lengthscale: float = 1.0, 
                 variance: float = 1.0, noise_variance: float = 0.01, seed: int = None):
        self.rff = RandomFourierFeatures(d, D, 'rbf', lengthscale, variance, seed)
        self.noise_var = noise_variance
        self.D = D
        self.fitted = False
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'RFFGaussianProcess':
        """
        Fit the RFF-GP model.
        
        This is Bayesian linear regression: p(w) = N(0, I), y = Φw + ε
        Posterior: p(w|y) = N(m, S)
        """
        Phi = self.rff.transform(X)  # (N, D)
        
        # Posterior covariance: S = (Φ^T Φ / σ² + I)^{-1}
        A = Phi.T @ Phi / self.noise_var + np.eye(self.D)
        self.L = np.linalg.cholesky(A)
        
        # Posterior mean: m = S Φ^T y / σ²
        self.m = np.linalg.solve(self.L.T, np.linalg.solve(self.L, Phi.T @ y)) / self.noise_var
        
        self.fitted = True
        return self
    
    def predict(self, X_star: np.ndarray, return_var: bool = True):
        """
        Predictive distribution at test points.
        """
        if not self.fitted:
            raise RuntimeError("Model must be fitted before prediction")
        
        Phi_star = self.rff.transform(X_star)  # (N*, D)
        
        # Mean: μ* = Φ* m
        mu = Phi_star @ self.m
        
        if not return_var:
            return mu
        
        # Variance: σ²* = Φ* S Φ*^T + σ²_n (diagonal only)
        # S = L^{-T} L^{-1}, so Φ* S Φ*^T diagonal = ||L^{-1} Φ*^T||² per column
        v = np.linalg.solve(self.L, Phi_star.T)  # (D, N*)
        var = np.sum(v**2, axis=0) + self.noise_var
        
        return mu, var
 
if __name__ == "__main__":
    demonstrate_rff_approximation()
    
    print("
" + "=" * 55)
    print("RFF-GP Regression Example")
    print("=" * 55)
    
    np.random.seed(42)
    N, d = 5000, 3
    X = np.random.randn(N, d) * 2
    y = np.sin(X[:, 0]) + 0.5 * X[:, 1]**2 + 0.1 * np.random.randn(N)
    
    # Train RFF-GP
    import time
    start = time.perf_counter()
    model = RFFGaussianProcess(d=d, D=500, lengthscale=1.0, noise_variance=0.01)
    model.fit(X, y)
    train_time = time.perf_counter() - start
    
    # Predict
    X_test = np.random.randn(100, d)
    mu, var = model.predict(X_test)
    
    print(f"Training time (N={N}, D=500): {train_time:.3f}s")
    print(f"Prediction uncertainty: std ∈ [{np.sqrt(var).min():.3f}, {np.sqrt(var).max():.3f}]")

Approximation Error Bounds

One of RFF's attractive properties is formal guarantees on approximation quality. These bounds inform how many random features are needed for a given accuracy target.

Main Theorem (Rahimi & Recht, 2007):

For a bounded set $\mathcal{X}$ with diameter $L$, and D random Fourier features:

$$\mathbb{P}\left[\sup_{\mathbf{x}, \mathbf{x}' \in \mathcal{X}} |k(\mathbf{x}, \mathbf{x}') - \hat{k}(\mathbf{x}, \mathbf{x}')| \geq \epsilon\right] \leq 2^8 \left(\frac{\sigma_p L}{\epsilon}\right)^2 \exp\left(-\frac{D\epsilon^2}{4(d+2)}\right)$$

where $\sigma_p^2 = \mathbb{E}_{\omega \sim p}[\|\omega\|^2]$ is the second moment of the spectral density.

Interpretation:

To achieve uniform error $\epsilon$ with high probability $1 - \delta$:

$$D = O\left(\frac{d \log(L\sigma_p/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)$$

The number of features scales:

Logarithmically with domain size $L$
Linearly with input dimension $d$
Inversely with squared error tolerance $\epsilon^{-2}$

Practical Implications

For typical applications, D = 500-5000 random features provide excellent approximation. The dimension d matters less than you might expect due to the log dependence on domain size. In practice, start with D = 1000 and increase if predictions are unsatisfactory.

Variance of the estimator:

The RFF kernel estimate is unbiased: $\mathbb{E}[\hat{k}] = k$. Its variance is:

$$\text{Var}[\hat{k}(\mathbf{x}, \mathbf{x}')] = \frac{1}{D}\left(\mathbb{E}_{\omega}[|e^{i\omega^\top(\mathbf{x}-\mathbf{x}')}|^2] - |k(\mathbf{x}-\mathbf{x}')|^2\right) = \frac{1 - k^2(\mathbf{x}-\mathbf{x}')}{D}$$

The variance is largest when $k \approx 0$ (distant points) and vanishes as $k \to 1$ (nearby points). This is favorable: we get the best approximation where the kernel is strongest, and accept more variance where the kernel contribution is negligible.

Concentration for learning:

For supervised learning, what matters is not kernel approximation per se, but whether predictions are accurate. Tighter analyses show that generalization error degrades gracefully with kernel approximation error, with the effective sample complexity increasing by approximately $(1 + \epsilon)$ where $\epsilon$ is the kernel approximation error.

Advanced Random Feature Variants

The basic RFF has been extended in numerous ways to improve approximation quality, reduce variance, and handle more general kernels:

1. Orthogonal Random Features (ORF):

Yu et al. (2016) showed that orthogonalizing the random frequency matrix $\Omega$ reduces variance. Instead of sampling independently:

Standard RFF: Ω = [ω_1, ..., ω_{D/2}]^T, each ω_i ~ p(ω)
Orthogonal RF: Ω = S × Q, where S ~ p(ω) scales, Q is random orthogonal

ORF achieves the same expected value with lower variance, requiring fewer features for the same accuracy.

2. Quadrature Fourier Features (QFF):

Mutny & Krause (2018) replace Monte Carlo sampling with deterministic quadrature rules. For product spectral densities:

$$p(\omega) = \prod_{j=1}^d p_j(\omega_j)$$

Quadrature nodes provide "optimal" samples that minimize integration error. QFF is especially effective in low-to-moderate dimensions (d ≤ 10).

Comparison of Random Feature Variants
Method	Variance	Complexity	Best For
Standard RFF	O(1/D)	O(NDd)	General purpose, simple implementation
Orthogonal RF (ORF)	O(1/D²)	O(NDd + D²d)	When variance reduction is critical
Quadrature FF (QFF)	O(exp(-D^{1/d}))	O(ND^{d/d})	Low-dimensional inputs (d ≤ 5)
Structured ORF (SORF)	O(1/D²)	O(ND log D)	Large D with structured matrices

3. Structured Orthogonal Random Features (SORF):

Le et al. (2013) use structured matrices (Hadamard, Fourier) to implement the orthogonal projection efficiently:

$$\Phi(\mathbf{x}) = \sqrt{\frac{1}{D}} \cdot \left[\begin{matrix} \cos(\Omega \mathbf{x}) \\ \sin(\Omega \mathbf{x}) \end{matrix}\right], \quad \Omega = \frac{1}{\sigma}SHD$$

where H is the Hadamard matrix, D is a random diagonal of ±1, and S is a diagonal scaling. The matrix-vector product $\Omega \mathbf{x}$ can be computed in O(d log d) via the fast Hadamard transform, giving total complexity O(N D log D) instead of O(NDd).

4. Generalized Spectral Kernels:

For non-stationary or structured kernels, alternative feature constructions exist:

Arc-cosine features for neural network-like kernels
Polynomial sketch for polynomial kernels
Tensor sketch for efficient polynomial products

Learning the Spectral Density

Recent work (Wilson & Adams, 2013; Samo & Roberts, 2015) learns the spectral density from data, combining the flexibility of non-parametric spectral mixture kernels with the efficiency of random features. This enables adapting to complex periodic patterns and multi-scale structure while maintaining computational tractability.

RFF for Gaussian Process Regression

With an explicit feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$, Gaussian Process regression transforms into Bayesian linear regression:

Model: $$f(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{w}, \quad \mathbf{w} \sim \mathcal{N}(\mathbf{0}, I_D)$$ $$y = f(\mathbf{x}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)$$

The prior over functions induced by this model has covariance: $$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \phi(\mathbf{x})^\top \phi(\mathbf{x}') = \hat{k}(\mathbf{x}, \mathbf{x}')$$

which approximates the target kernel.

Posterior inference:

Given data $(X, \mathbf{y})$ with feature matrix $\Phi = [\phi(\mathbf{x}_1), ..., \phi(\mathbf{x}_N)]^\top \in \mathbb{R}^{N \times D}$:

$$p(\mathbf{w}|\mathbf{y}) = \mathcal{N}(\mathbf{m}_w, S_w)$$ $$S_w = (\sigma_n^{-2}\Phi^\top\Phi + I_D)^{-1}$$ $$\mathbf{m}_w = \sigma_n^{-2}S_w\Phi^\top\mathbf{y}$$

Predictive distribution: $$p(f_|\mathbf{y}) = \mathcal{N}(\phi_^\top\mathbf{m}w, \phi^\top S_w \phi_)$$

Computational complexity:

Operation	Exact GP	RFF-GP
Training	O(N³)	O(ND² + D³)
Prediction (mean)	O(N)	O(D)
Prediction (var)	O(N²)	O(D²)
Memory	O(N²)	O(ND + D²)

For D << N, RFF-GP is dramatically faster. With D = 1000 and N = 100,000:

Exact GP: ~10¹⁵ operations (years of computation)
RFF-GP: ~10¹¹ operations (minutes)

When to use RFF-GP vs Sparse GP:

RFF-GP advantages: Simpler implementation, no inducing point selection, predictions O(D) not O(M)
Sparse GP advantages: Adapts to data distribution, can be more accurate for same compute, handles long tails better
Hybrid approaches: Use sparse GP with RFF features for best of both

RFF Limitations

RFF-GP is not a panacea. It only works for stationary kernels, it cannot adapt feature locations to data (unlike learned inducing points), and the approximation quality is uniform rather than data-adaptive. For highly non-uniform data or when predictive uncertainty near specific regions is critical, variational sparse GPs may be superior.

Hyperparameter Learning with RFF

Learning kernel hyperparameters (length-scale, variance) in the RFF framework requires care because the random features depend on these parameters.

The challenge:

In standard RFF, frequencies $\omega_j$ are sampled from $p(\omega; \theta)$ where $\theta$ includes length-scales. Changing $\theta$ changes the spectral density, requiring new samples—but we want to optimize $\theta$ via gradient descent.

Solution 1: Reparameterization

For the SE kernel with length-scale $\ell$, sample base frequencies from $\mathcal{N}(0, I)$ and scale:

omega_base = np.random.randn(D//2, d)  # Sample once
omega = omega_base / lengthscale       # Scale by current hyperparameter

Now $\omega$ is differentiable with respect to $\ell$, enabling gradient-based optimization.

Solution 2: Implicit reparameterization

For more complex spectral densities, use the reparameterization trick: $$\omega = T(\epsilon; \theta), \quad \epsilon \sim p_0(\epsilon)$$

where $T$ is a differentiable transformation and $p_0$ is a fixed base distribution.

rff_hyperparameter_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam
 
class LearnableRFFGP(nn.Module):
    """
    RFF-GP with learnable hyperparameters via reparameterization.
    
    Key insight: sample base frequencies ω_0 ~ N(0, I) once,
    then scale by lengthscale: ω = ω_0 / ℓ
    This makes the likelihood differentiable w.r.t. ℓ.
    """
    
    def __init__(self, d: int, D: int, seed: int = None):
        super().__init__()
        
        self.d = d
        self.D = D
        
        # Fixed base frequencies and phases (sampled once)
        rng = np.random.RandomState(seed)
        omega_base = rng.randn(D // 2, d)
        b = rng.uniform(0, 2 * np.pi, size=D // 2)
        
        self.register_buffer('omega_base', torch.tensor(omega_base, dtype=torch.float64))
        self.register_buffer('b', torch.tensor(b, dtype=torch.float64))
        
        # Learnable hyperparameters (in log space)
        self.log_lengthscale = nn.Parameter(torch.zeros(d, dtype=torch.float64))
        self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64))
        self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64))
    
    @property 
    def lengthscales(self):
        return torch.exp(self.log_lengthscale)
    
    @property
    def variance(self):
        return torch.exp(self.log_variance)
    
    @property
    def noise_variance(self):
        return torch.exp(self.log_noise)
    
    def features(self, X: torch.Tensor) -> torch.Tensor:
        """Compute RFF features with current hyperparameters."""
        # Scale frequencies by lengthscales (ARD)
        omega = self.omega_base / self.lengthscales  # (D/2, d)
        
        # Projections
        proj = X @ omega.T + self.b  # (N, D/2)
        
        # Features
        Phi = torch.zeros(X.shape[0], self.D, dtype=torch.float64)
        Phi[:, 0::2] = torch.cos(proj)
        Phi[:, 1::2] = torch.sin(proj)
        
        # Scale by sqrt(2σ²/D)
        Phi = Phi * torch.sqrt(2 * self.variance / self.D)
        
        return Phi
    
    def log_marginal_likelihood(self, X: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        """
        Compute log p(y | X, θ) for hyperparameter optimization.
        
        Uses the Bayesian linear regression marginal likelihood:
        log p(y) = log N(y; 0, Φ Φ^T + σ²I)
        
        Computed efficiently via Woodbury identity.
        """
        Phi = self.features(X)  # (N, D)
        N, D = Phi.shape
        
        # Cholesky of Φ^T Φ + σ²I (D×D matrix)
        A = Phi.T @ Phi + self.noise_variance * torch.eye(D, dtype=torch.float64)
        L_A = torch.linalg.cholesky(A)
        
        # log|Φ Φ^T + σ²I| = log|A| + (N-D)log(σ²)  [Woodbury]
        log_det_A = 2 * torch.sum(torch.log(torch.diag(L_A)))
        log_det = log_det_A + (N - D) * self.log_noise
        
        # y^T (Φ Φ^T + σ²I)^{-1} y via Woodbury
        # = y^T y / σ² - y^T Φ (A)^{-1} Φ^T y / σ⁴
        y_Phi = y @ Phi  # (D,)
        A_inv_Phi_y = torch.linalg.solve(L_A.T, torch.linalg.solve(L_A, y_Phi))
        
        quad = torch.dot(y, y) / self.noise_variance - torch.dot(y_Phi, A_inv_Phi_y) / self.noise_variance**2
        
        lml = -0.5 * quad - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi)
        
        return lml
    
    def fit(self, X: np.ndarray, y: np.ndarray, n_iterations: int = 200, lr: float = 0.05):
        """Optimize hyperparameters via marginal likelihood."""
        X_t = torch.tensor(X, dtype=torch.float64)
        y_t = torch.tensor(y, dtype=torch.float64)
        
        optimizer = Adam(self.parameters(), lr=lr)
        
        for i in range(n_iterations):
            optimizer.zero_grad()
            lml = self.log_marginal_likelihood(X_t, y_t)
            (-lml).backward()  # Maximize LML
            optimizer.step()
            
            if (i + 1) % 50 == 0:
                ls = self.lengthscales.detach().numpy()
                print(f"Iter {i+1}: LML={lml.item():.2f}, "
                      f"lengthscales={ls[:min(3,len(ls))]}, "
                      f"noise={self.noise_variance.item():.4f}")
        
        return self
    
    def predict(self, X_train: torch.Tensor, y_train: torch.Tensor, 
                X_test: torch.Tensor) -> tuple:
        """Compute predictive distribution at test points."""
        with torch.no_grad():
            Phi_train = self.features(X_train)
            Phi_test = self.features(X_test)
            
            # Posterior over weights
            A = Phi_train.T @ Phi_train / self.noise_variance + torch.eye(self.D, dtype=torch.float64)
            L = torch.linalg.cholesky(A)
            
            m = torch.linalg.solve(L.T, torch.linalg.solve(L, Phi_train.T @ y_train)) / self.noise_variance
            
            # Predictive moments
            mu = Phi_test @ m
            
            # Variance
            v = torch.linalg.solve(L, Phi_test.T)
            var = torch.sum(v**2, dim=0) + self.noise_variance
            
            return mu, var
 
# Example: Learning hyperparameters
if __name__ == "__main__":
    np.random.seed(42)
    torch.manual_seed(42)
    
    # Generate data with known structure
    N = 1000
    d = 3
    true_lengthscales = np.array([0.5, 1.0, 2.0])  # Different per dimension
    
    X = np.random.randn(N, d)
    # Function with ARD structure: first dimension matters most
    y = 3 * np.sin(X[:, 0] / 0.5) + 0.5 * np.cos(X[:, 1] / 1.0) + 0.1 * np.random.randn(N)
    
    print("Learning Hyperparameters with RFF-GP")
    print("=" * 50)
    print(f"True lengthscales (approx): {true_lengthscales}")
    print()
    
    model = LearnableRFFGP(d=d, D=500, seed=42)
    model.fit(X, y, n_iterations=150, lr=0.05)
    
    print(f"
Learned lengthscales: {model.lengthscales.detach().numpy()}")
    print(f"Learned noise var: {model.noise_variance.item():.4f}")

RFF vs Sparse GPs: When to Use Which

Both random features and sparse GPs provide O(NM²) or O(ND²) complexity, but they achieve this through fundamentally different mechanisms with distinct trade-offs:

Random Features vs Sparse GPs
Aspect	Random Features (RFF)	Sparse GPs (VFE/SVGP)
Approximation target	Kernel function globally	GP posterior locally
Kernel types	Stationary only	Any positive-definite
Adaptivity	Uniform across input space	Concentrated near data
Hyperparameter learning	Via marginal likelihood	Via ELBO
Uncertainty	May underestimate in sparse regions	Better near inducing points
Implementation	Simpler (linear algebra)	More complex (Cholesky, etc.)
Prediction cost	O(D) per point	O(M) per point
Minibatch training	Straightforward	Requires variational formulation
Memory	O(D²) for posterior	O(M²) for K_uu inverse

When to choose Random Features:

Simple stationary kernels (SE, Matérn) where RFF provides good approximation
Uniform data distribution where adaptive inducing points aren't crucial
Very fast predictions needed (O(D) vs O(M))
Implementation simplicity is valued
Online/streaming settings where features can be pre-computed

When to choose Sparse GPs:

Non-stationary kernels or complex kernel compositions
Non-uniform data where inducing points should concentrate in dense regions
Critical uncertainty quantification near specific input regions
Principled model comparison via ELBO
Active learning / Bayesian optimization where exploration matters

Hybrid Approaches

Recent work combines RFF with sparse GPs: use random features as the base kernel, then apply sparse methods on top. This provides the best of both worlds—global approximation quality from RFF and local adaptation from inducing points. Libraries like GPyTorch support these composable scalability approaches.

Summary: Finite-Dimensional Kernel Approximation

We have developed the random feature framework for scalable Gaussian Processes—an elegant approach that approximates the kernel itself rather than the GP structure. The key insights from this module's final page:

Key Takeaways

•Bochner's theorem enables the approach: Stationary kernels are Fourier transforms of spectral densities, enabling Monte Carlo approximation via random frequency sampling.
•RFF provides explicit feature maps: The D-dimensional features φ(x) approximate k(x,x') ≈ φ(x)^T φ(x'), converting GP inference to linear regression.
•Approximation error is well-understood: Formal bounds guarantee O(1/√D) uniform error, with D scaling logarithmically in domain size.
•Advanced variants reduce variance: Orthogonal random features, quadrature methods, and structured matrices improve efficiency without sacrificing accuracy.
•Hyperparameters are learnable: Reparameterization tricks enable gradient-based optimization of length-scales and other kernel parameters.
•RFF and sparse GPs are complementary: Each excels in different scenarios; hybrid approaches combine their strengths.

Module conclusion:

This module has comprehensively covered scalable Gaussian Process methods—from understanding why exact GPs don't scale, through sparse approximations and inducing point selection, to the principled variational framework and random feature alternatives. You now possess the full toolkit for deploying Gaussian Processes at scale:

Exact GPs: For N ≤ 10,000 when computational resources allow
Sparse GPs (VFE/SVGP): For N up to millions with adaptive, data-driven approximation
Random Features: For stationary kernels with simplicity and speed as priorities
Hybrid methods: For the best of all worlds when maximum flexibility is needed

With these techniques, Gaussian Processes transform from an elegant but impractical method into a production-ready tool for large-scale probabilistic machine learning.

Module Complete

Congratulations! You have completed the Scalable GPs module. You now understand the computational barriers of exact GP inference, the theory and practice of sparse approximations, variational inference for scalability, and random feature methods. These techniques power real-world GP deployments at companies like Uber, Google, and Amazon, handling predictions at scales that would have seemed impossible a decade ago.

5 / 5

Loading learning content...

Scalable Gaussian Processes - Page 5 of 5 | OneNoughtOne

Machine LearningGaussian Processes

Scalable Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Random Features

Approximating Kernels with Finite Dimensions

The core insight, formalized in Rahimi and Recht's seminal 2007 paper, is that many kernel functions can be written as expectations over explicit feature maps:

$$k(\mathbf{x}, \mathbf{x}') = \mathbb{E}{\omega \sim p(\omega)}[\phi\omega(\mathbf{x})^\top \phi_\omega(\mathbf{x}')]$$

By sampling D random "frequencies" $\{\omega_1, ..., \omega_D\}$ from the distribution $p(\omega)$, we obtain a Monte Carlo approximation:

$$k(\mathbf{x}, \mathbf{x}') \approx \hat{\phi}(\mathbf{x})^\top \hat{\phi}(\mathbf{x}')$$

where $\hat{\phi}(\mathbf{x}) \in \mathbb{R}^D$ is an explicit, low-dimensional feature vector.

What You Will Learn

Bochner's Theorem: The Theoretical Foundation

The mathematical foundation for random features comes from Bochner's theorem, a foundational result in harmonic analysis that characterizes positive-definite functions.

Bochner's Theorem (1933):

$$k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{i \boldsymbol{\omega}^\top \boldsymbol{\tau}} p(\boldsymbol{\omega}) d\boldsymbol{\omega}$$

When $k(\mathbf{0}) = 1$ (normalized kernel), $p(\boldsymbol{\omega})$ is a proper probability distribution called the spectral density of the kernel.

Key insight:

The kernel can be written as an expectation:

where $\overline{\cdot}$ denotes complex conjugate.

For real-valued kernels, we can use the real part of the exponential:

$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega}, b}\left[\sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x} + b) \cdot \sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x}' + b)\right]$$

where $b \sim \text{Uniform}[0, 2\pi)$.

Spectral Densities for Common Stationary Kernels
Kernel	k(τ)	Spectral Density p(ω)
Squared Exponential (RBF)	exp(-\|\|τ\|\|²/2ℓ²)	N(0, ℓ⁻²I) — Gaussian
Matérn (ν=1/2, Exp)	exp(-\|\|τ\|\|/ℓ)	Cauchy: (1 + ℓ²\|\|ω\|\|²)⁻¹
Matérn (ν=3/2)	(1 + √3\|\|τ\|\|/ℓ)exp(-√3\|\|τ\|\|/ℓ)	Student-t-like
Matérn (ν→∞)	exp(-\|\|τ\|\|²/2ℓ²)	Gaussian (same as SE)
Periodic	exp(-2sin²(π\|\|τ\|\|/p)/ℓ²)	Bessel functions

Stationary Kernels Only

Random Fourier Features: The Algorithm

Random Fourier Features (RFF) directly implement the Monte Carlo approximation to Bochner's expectation:

Algorithm:

Random Fourier Features Algorithm

Input: Kernel k with spectral density p(ω), number of features D
Output: Feature map φ: R^d → R^D

1. Sample {ω_1, ..., ω_{D/2}} ~ p(ω)
2. Sample {b_1, ..., b_{D/2}} ~ Uniform[0, 2π)
3. Define feature map:
   φ(x) = sqrt(2/D) * [cos(ω_1^T x + b_1), sin(ω_1^T x + b_1),
                       cos(ω_2^T x + b_2), sin(ω_2^T x + b_2),
                       ...,
                       cos(ω_{D/2}^T x + b_{D/2}), sin(ω_{D/2}^T x + b_{D/2})]
4. Return φ

The kernel is then approximated as: $$\hat{k}(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')$$

For the Squared Exponential kernel with length-scale $\ell$:

Spectral density: $p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \ell^{-2}I)$
Sample: $\boldsymbol{\omega}_j \sim \mathcal{N}(\mathbf{0}, \ell^{-2}I)$
Feature: $\phi_j(\mathbf{x}) = \sqrt{\frac{2}{D}}\cos(\boldsymbol{\omega}_j^\top \mathbf{x} + b_j)$

random_fourier_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
import numpy as np
from scipy.stats import norm
 
class RandomFourierFeatures:
    """
    Random Fourier Features for kernel approximation.
    
    Implements the Rahimi & Recht (2007) algorithm for approximating
    stationary kernels via random sampling from the spectral density.
    """
    
    def __init__(self, d: int, D: int, kernel_type: str = 'rbf', 
                 lengthscale: float = 1.0, variance: float = 1.0, seed: int = None):
        """
        Args:
            d: Input dimensionality
            D: Number of random features (must be even for sin+cos)
            kernel_type: 'rbf' (squared exponential) or 'matern12' (exponential)
            lengthscale: Kernel length-scale parameter
            variance: Kernel variance (output scale)
            seed: Random seed for reproducibility
        """
        self.d = d
        self.D = D
        self.kernel_type = kernel_type
        self.lengthscale = lengthscale
        self.variance = variance
        
        rng = np.random.RandomState(seed)
        
        # Sample frequencies from spectral density
        if kernel_type == 'rbf':
            # SE kernel: p(ω) = N(0, ℓ^{-2}I)
            self.omega = rng.randn(D // 2, d) / lengthscale
        elif kernel_type == 'matern12':
            # Exponential kernel: p(ω) ∝ (1 + ℓ²||ω||²)^{-(d+1)/2}
            # This is a multivariate Cauchy (Student-t with ν=1)
            # Sample via: ω = z / sqrt(u) where z~N(0,I), u~χ²(1)
            z = rng.randn(D // 2, d)
            u = rng.chisquare(1, size=(D // 2, 1))
            self.omega = z / np.sqrt(u) / lengthscale
        else:
            raise ValueError(f"Unknown kernel type: {kernel_type}")
        
        # Sample phases
        self.b = rng.uniform(0, 2 * np.pi, size=D // 2)
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """
        Compute random features φ(X).
        
        Args:
            X: (N, d) input points
        
        Returns:
            Phi: (N, D) random feature matrix
        """
        N = len(X)
        D = self.D
        
        # Compute ω^T x + b for all points and frequencies
        projection = X @ self.omega.T + self.b  # (N, D/2)
        
        # Compute cos and sin features
        Phi = np.zeros((N, D))
        Phi[:, 0::2] = np.cos(projection)  # Even indices
        Phi[:, 1::2] = np.sin(projection)  # Odd indices
        
        # Scale by sqrt(2/D) and kernel variance
        Phi *= np.sqrt(2.0 / D) * np.sqrt(self.variance)
        
        return Phi
    
    def kernel_approximation(self, X1: np.ndarray, X2: np.ndarray = None) -> np.ndarray:
        """
        Compute approximate kernel matrix K̂ = Φ(X1) @ Φ(X2)^T.
        """
        Phi1 = self.transform(X1)
        if X2 is None:
            return Phi1 @ Phi1.T
        Phi2 = self.transform(X2)
        return Phi1 @ Phi2.T
 
def demonstrate_rff_approximation():
    """Show how RFF approximation improves with D."""
    from numpy.linalg import norm
    
    np.random.seed(42)
    d = 5
    N = 200
    X = np.random.randn(N, d)
    
    # Exact SE kernel
    lengthscale = 1.0
    dist_sq = np.sum((X[:, None] - X[None, :])**2, axis=2)
    K_exact = np.exp(-0.5 * dist_sq / lengthscale**2)
    
    print("RFF Approximation Quality vs Number of Features")
    print("=" * 55)
    print(f"{'D':>8} | {'Relative Error':>15} | {'Approx Time':>12}")
    print("-" * 55)
    
    for D in [10, 50, 100, 500, 1000, 5000]:
        import time
        start = time.perf_counter()
        
        rff = RandomFourierFeatures(d=d, D=D, kernel_type='rbf', 
                                    lengthscale=lengthscale, seed=42)
        K_approx = rff.kernel_approximation(X)
        
        elapsed = time.perf_counter() - start
        rel_error = norm(K_exact - K_approx, 'fro') / norm(K_exact, 'fro')
        
        print(f"{D:>8} | {rel_error:>15.6f} | {elapsed*1000:>10.2f} ms")
 
class RFFGaussianProcess:
    """
    Gaussian Process regression using Random Fourier Features.
    
    Converts GP inference to Bayesian linear regression in feature space,
    reducing complexity from O(N³) to O(ND² + D³).
    """
    
    def __init__(self, d: int, D: int, lengthscale: float = 1.0, 
                 variance: float = 1.0, noise_variance: float = 0.01, seed: int = None):
        self.rff = RandomFourierFeatures(d, D, 'rbf', lengthscale, variance, seed)
        self.noise_var = noise_variance
        self.D = D
        self.fitted = False
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'RFFGaussianProcess':
        """
        Fit the RFF-GP model.
        
        This is Bayesian linear regression: p(w) = N(0, I), y = Φw + ε
        Posterior: p(w|y) = N(m, S)
        """
        Phi = self.rff.transform(X)  # (N, D)
        
        # Posterior covariance: S = (Φ^T Φ / σ² + I)^{-1}
        A = Phi.T @ Phi / self.noise_var + np.eye(self.D)
        self.L = np.linalg.cholesky(A)
        
        # Posterior mean: m = S Φ^T y / σ²
        self.m = np.linalg.solve(self.L.T, np.linalg.solve(self.L, Phi.T @ y)) / self.noise_var
        
        self.fitted = True
        return self
    
    def predict(self, X_star: np.ndarray, return_var: bool = True):
        """
        Predictive distribution at test points.
        """
        if not self.fitted:
            raise RuntimeError("Model must be fitted before prediction")
        
        Phi_star = self.rff.transform(X_star)  # (N*, D)
        
        # Mean: μ* = Φ* m
        mu = Phi_star @ self.m
        
        if not return_var:
            return mu
        
        # Variance: σ²* = Φ* S Φ*^T + σ²_n (diagonal only)
        # S = L^{-T} L^{-1}, so Φ* S Φ*^T diagonal = ||L^{-1} Φ*^T||² per column
        v = np.linalg.solve(self.L, Phi_star.T)  # (D, N*)
        var = np.sum(v**2, axis=0) + self.noise_var
        
        return mu, var
 
if __name__ == "__main__":
    demonstrate_rff_approximation()
    
    print("
" + "=" * 55)
    print("RFF-GP Regression Example")
    print("=" * 55)
    
    np.random.seed(42)
    N, d = 5000, 3
    X = np.random.randn(N, d) * 2
    y = np.sin(X[:, 0]) + 0.5 * X[:, 1]**2 + 0.1 * np.random.randn(N)
    
    # Train RFF-GP
    import time
    start = time.perf_counter()
    model = RFFGaussianProcess(d=d, D=500, lengthscale=1.0, noise_variance=0.01)
    model.fit(X, y)
    train_time = time.perf_counter() - start
    
    # Predict
    X_test = np.random.randn(100, d)
    mu, var = model.predict(X_test)
    
    print(f"Training time (N={N}, D=500): {train_time:.3f}s")
    print(f"Prediction uncertainty: std ∈ [{np.sqrt(var).min():.3f}, {np.sqrt(var).max():.3f}]")

Approximation Error Bounds

One of RFF's attractive properties is formal guarantees on approximation quality. These bounds inform how many random features are needed for a given accuracy target.

Main Theorem (Rahimi & Recht, 2007):

For a bounded set $\mathcal{X}$ with diameter $L$, and D random Fourier features:

where $\sigma_p^2 = \mathbb{E}_{\omega \sim p}[\|\omega\|^2]$ is the second moment of the spectral density.

Interpretation:

To achieve uniform error $\epsilon$ with high probability $1 - \delta$:

$$D = O\left(\frac{d \log(L\sigma_p/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)$$

The number of features scales:

Logarithmically with domain size $L$
Linearly with input dimension $d$
Inversely with squared error tolerance $\epsilon^{-2}$

Practical Implications

Variance of the estimator:

The RFF kernel estimate is unbiased: $\mathbb{E}[\hat{k}] = k$. Its variance is:

Concentration for learning:

Advanced Random Feature Variants

The basic RFF has been extended in numerous ways to improve approximation quality, reduce variance, and handle more general kernels:

1. Orthogonal Random Features (ORF):

Yu et al. (2016) showed that orthogonalizing the random frequency matrix $\Omega$ reduces variance. Instead of sampling independently:

Standard RFF: Ω = [ω_1, ..., ω_{D/2}]^T, each ω_i ~ p(ω)
Orthogonal RF: Ω = S × Q, where S ~ p(ω) scales, Q is random orthogonal

ORF achieves the same expected value with lower variance, requiring fewer features for the same accuracy.

2. Quadrature Fourier Features (QFF):

Mutny & Krause (2018) replace Monte Carlo sampling with deterministic quadrature rules. For product spectral densities:

$$p(\omega) = \prod_{j=1}^d p_j(\omega_j)$$

Quadrature nodes provide "optimal" samples that minimize integration error. QFF is especially effective in low-to-moderate dimensions (d ≤ 10).

Comparison of Random Feature Variants
Method	Variance	Complexity	Best For
Standard RFF	O(1/D)	O(NDd)	General purpose, simple implementation
Orthogonal RF (ORF)	O(1/D²)	O(NDd + D²d)	When variance reduction is critical
Quadrature FF (QFF)	O(exp(-D^{1/d}))	O(ND^{d/d})	Low-dimensional inputs (d ≤ 5)
Structured ORF (SORF)	O(1/D²)	O(ND log D)	Large D with structured matrices

3. Structured Orthogonal Random Features (SORF):

Le et al. (2013) use structured matrices (Hadamard, Fourier) to implement the orthogonal projection efficiently:

$$\Phi(\mathbf{x}) = \sqrt{\frac{1}{D}} \cdot \left[\begin{matrix} \cos(\Omega \mathbf{x}) \\ \sin(\Omega \mathbf{x}) \end{matrix}\right], \quad \Omega = \frac{1}{\sigma}SHD$$

4. Generalized Spectral Kernels:

For non-stationary or structured kernels, alternative feature constructions exist:

Arc-cosine features for neural network-like kernels
Polynomial sketch for polynomial kernels
Tensor sketch for efficient polynomial products

Learning the Spectral Density

RFF for Gaussian Process Regression

With an explicit feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$, Gaussian Process regression transforms into Bayesian linear regression:

Model: $$f(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{w}, \quad \mathbf{w} \sim \mathcal{N}(\mathbf{0}, I_D)$$ $$y = f(\mathbf{x}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)$$

The prior over functions induced by this model has covariance: $$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \phi(\mathbf{x})^\top \phi(\mathbf{x}') = \hat{k}(\mathbf{x}, \mathbf{x}')$$

which approximates the target kernel.

Posterior inference:

Given data $(X, \mathbf{y})$ with feature matrix $\Phi = [\phi(\mathbf{x}_1), ..., \phi(\mathbf{x}_N)]^\top \in \mathbb{R}^{N \times D}$:

$$p(\mathbf{w}|\mathbf{y}) = \mathcal{N}(\mathbf{m}_w, S_w)$$ $$S_w = (\sigma_n^{-2}\Phi^\top\Phi + I_D)^{-1}$$ $$\mathbf{m}_w = \sigma_n^{-2}S_w\Phi^\top\mathbf{y}$$

Predictive distribution: $$p(f_|\mathbf{y}) = \mathcal{N}(\phi_^\top\mathbf{m}w, \phi^\top S_w \phi_)$$

Computational complexity:

Operation	Exact GP	RFF-GP
Training	O(N³)	O(ND² + D³)
Prediction (mean)	O(N)	O(D)
Prediction (var)	O(N²)	O(D²)
Memory	O(N²)	O(ND + D²)

For D << N, RFF-GP is dramatically faster. With D = 1000 and N = 100,000:

Exact GP: ~10¹⁵ operations (years of computation)
RFF-GP: ~10¹¹ operations (minutes)

When to use RFF-GP vs Sparse GP:

RFF-GP advantages: Simpler implementation, no inducing point selection, predictions O(D) not O(M)
Sparse GP advantages: Adapts to data distribution, can be more accurate for same compute, handles long tails better
Hybrid approaches: Use sparse GP with RFF features for best of both

RFF Limitations

Hyperparameter Learning with RFF

Learning kernel hyperparameters (length-scale, variance) in the RFF framework requires care because the random features depend on these parameters.

The challenge:

Solution 1: Reparameterization

For the SE kernel with length-scale $\ell$, sample base frequencies from $\mathcal{N}(0, I)$ and scale:

omega_base = np.random.randn(D//2, d)  # Sample once
omega = omega_base / lengthscale       # Scale by current hyperparameter

Now $\omega$ is differentiable with respect to $\ell$, enabling gradient-based optimization.

Solution 2: Implicit reparameterization

For more complex spectral densities, use the reparameterization trick: $$\omega = T(\epsilon; \theta), \quad \epsilon \sim p_0(\epsilon)$$

where $T$ is a differentiable transformation and $p_0$ is a fixed base distribution.

rff_hyperparameter_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam
 
class LearnableRFFGP(nn.Module):
    """
    RFF-GP with learnable hyperparameters via reparameterization.
    
    Key insight: sample base frequencies ω_0 ~ N(0, I) once,
    then scale by lengthscale: ω = ω_0 / ℓ
    This makes the likelihood differentiable w.r.t. ℓ.
    """
    
    def __init__(self, d: int, D: int, seed: int = None):
        super().__init__()
        
        self.d = d
        self.D = D
        
        # Fixed base frequencies and phases (sampled once)
        rng = np.random.RandomState(seed)
        omega_base = rng.randn(D // 2, d)
        b = rng.uniform(0, 2 * np.pi, size=D // 2)
        
        self.register_buffer('omega_base', torch.tensor(omega_base, dtype=torch.float64))
        self.register_buffer('b', torch.tensor(b, dtype=torch.float64))
        
        # Learnable hyperparameters (in log space)
        self.log_lengthscale = nn.Parameter(torch.zeros(d, dtype=torch.float64))
        self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64))
        self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64))
    
    @property 
    def lengthscales(self):
        return torch.exp(self.log_lengthscale)
    
    @property
    def variance(self):
        return torch.exp(self.log_variance)
    
    @property
    def noise_variance(self):
        return torch.exp(self.log_noise)
    
    def features(self, X: torch.Tensor) -> torch.Tensor:
        """Compute RFF features with current hyperparameters."""
        # Scale frequencies by lengthscales (ARD)
        omega = self.omega_base / self.lengthscales  # (D/2, d)
        
        # Projections
        proj = X @ omega.T + self.b  # (N, D/2)
        
        # Features
        Phi = torch.zeros(X.shape[0], self.D, dtype=torch.float64)
        Phi[:, 0::2] = torch.cos(proj)
        Phi[:, 1::2] = torch.sin(proj)
        
        # Scale by sqrt(2σ²/D)
        Phi = Phi * torch.sqrt(2 * self.variance / self.D)
        
        return Phi
    
    def log_marginal_likelihood(self, X: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        """
        Compute log p(y | X, θ) for hyperparameter optimization.
        
        Uses the Bayesian linear regression marginal likelihood:
        log p(y) = log N(y; 0, Φ Φ^T + σ²I)
        
        Computed efficiently via Woodbury identity.
        """
        Phi = self.features(X)  # (N, D)
        N, D = Phi.shape
        
        # Cholesky of Φ^T Φ + σ²I (D×D matrix)
        A = Phi.T @ Phi + self.noise_variance * torch.eye(D, dtype=torch.float64)
        L_A = torch.linalg.cholesky(A)
        
        # log|Φ Φ^T + σ²I| = log|A| + (N-D)log(σ²)  [Woodbury]
        log_det_A = 2 * torch.sum(torch.log(torch.diag(L_A)))
        log_det = log_det_A + (N - D) * self.log_noise
        
        # y^T (Φ Φ^T + σ²I)^{-1} y via Woodbury
        # = y^T y / σ² - y^T Φ (A)^{-1} Φ^T y / σ⁴
        y_Phi = y @ Phi  # (D,)
        A_inv_Phi_y = torch.linalg.solve(L_A.T, torch.linalg.solve(L_A, y_Phi))
        
        quad = torch.dot(y, y) / self.noise_variance - torch.dot(y_Phi, A_inv_Phi_y) / self.noise_variance**2
        
        lml = -0.5 * quad - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi)
        
        return lml
    
    def fit(self, X: np.ndarray, y: np.ndarray, n_iterations: int = 200, lr: float = 0.05):
        """Optimize hyperparameters via marginal likelihood."""
        X_t = torch.tensor(X, dtype=torch.float64)
        y_t = torch.tensor(y, dtype=torch.float64)
        
        optimizer = Adam(self.parameters(), lr=lr)
        
        for i in range(n_iterations):
            optimizer.zero_grad()
            lml = self.log_marginal_likelihood(X_t, y_t)
            (-lml).backward()  # Maximize LML
            optimizer.step()
            
            if (i + 1) % 50 == 0:
                ls = self.lengthscales.detach().numpy()
                print(f"Iter {i+1}: LML={lml.item():.2f}, "
                      f"lengthscales={ls[:min(3,len(ls))]}, "
                      f"noise={self.noise_variance.item():.4f}")
        
        return self
    
    def predict(self, X_train: torch.Tensor, y_train: torch.Tensor, 
                X_test: torch.Tensor) -> tuple:
        """Compute predictive distribution at test points."""
        with torch.no_grad():
            Phi_train = self.features(X_train)
            Phi_test = self.features(X_test)
            
            # Posterior over weights
            A = Phi_train.T @ Phi_train / self.noise_variance + torch.eye(self.D, dtype=torch.float64)
            L = torch.linalg.cholesky(A)
            
            m = torch.linalg.solve(L.T, torch.linalg.solve(L, Phi_train.T @ y_train)) / self.noise_variance
            
            # Predictive moments
            mu = Phi_test @ m
            
            # Variance
            v = torch.linalg.solve(L, Phi_test.T)
            var = torch.sum(v**2, dim=0) + self.noise_variance
            
            return mu, var
 
# Example: Learning hyperparameters
if __name__ == "__main__":
    np.random.seed(42)
    torch.manual_seed(42)
    
    # Generate data with known structure
    N = 1000
    d = 3
    true_lengthscales = np.array([0.5, 1.0, 2.0])  # Different per dimension
    
    X = np.random.randn(N, d)
    # Function with ARD structure: first dimension matters most
    y = 3 * np.sin(X[:, 0] / 0.5) + 0.5 * np.cos(X[:, 1] / 1.0) + 0.1 * np.random.randn(N)
    
    print("Learning Hyperparameters with RFF-GP")
    print("=" * 50)
    print(f"True lengthscales (approx): {true_lengthscales}")
    print()
    
    model = LearnableRFFGP(d=d, D=500, seed=42)
    model.fit(X, y, n_iterations=150, lr=0.05)
    
    print(f"
Learned lengthscales: {model.lengthscales.detach().numpy()}")
    print(f"Learned noise var: {model.noise_variance.item():.4f}")

RFF vs Sparse GPs: When to Use Which

Both random features and sparse GPs provide O(NM²) or O(ND²) complexity, but they achieve this through fundamentally different mechanisms with distinct trade-offs:

Random Features vs Sparse GPs
Aspect	Random Features (RFF)	Sparse GPs (VFE/SVGP)
Approximation target	Kernel function globally	GP posterior locally
Kernel types	Stationary only	Any positive-definite
Adaptivity	Uniform across input space	Concentrated near data
Hyperparameter learning	Via marginal likelihood	Via ELBO
Uncertainty	May underestimate in sparse regions	Better near inducing points
Implementation	Simpler (linear algebra)	More complex (Cholesky, etc.)
Prediction cost	O(D) per point	O(M) per point
Minibatch training	Straightforward	Requires variational formulation
Memory	O(D²) for posterior	O(M²) for K_uu inverse

When to choose Random Features:

Simple stationary kernels (SE, Matérn) where RFF provides good approximation
Uniform data distribution where adaptive inducing points aren't crucial
Very fast predictions needed (O(D) vs O(M))
Implementation simplicity is valued
Online/streaming settings where features can be pre-computed

When to choose Sparse GPs:

Non-stationary kernels or complex kernel compositions
Non-uniform data where inducing points should concentrate in dense regions
Critical uncertainty quantification near specific input regions
Principled model comparison via ELBO
Active learning / Bayesian optimization where exploration matters

Hybrid Approaches

Summary: Finite-Dimensional Kernel Approximation

Key Takeaways

•Bochner's theorem enables the approach: Stationary kernels are Fourier transforms of spectral densities, enabling Monte Carlo approximation via random frequency sampling.
•RFF provides explicit feature maps: The D-dimensional features φ(x) approximate k(x,x') ≈ φ(x)^T φ(x'), converting GP inference to linear regression.
•Approximation error is well-understood: Formal bounds guarantee O(1/√D) uniform error, with D scaling logarithmically in domain size.
•Advanced variants reduce variance: Orthogonal random features, quadrature methods, and structured matrices improve efficiency without sacrificing accuracy.
•Hyperparameters are learnable: Reparameterization tricks enable gradient-based optimization of length-scales and other kernel parameters.
•RFF and sparse GPs are complementary: Each excels in different scenarios; hybrid approaches combine their strengths.

Module conclusion:

Exact GPs: For N ≤ 10,000 when computational resources allow
Sparse GPs (VFE/SVGP): For N up to millions with adaptive, data-driven approximation
Random Features: For stationary kernels with simplicity and speed as priorities
Hybrid methods: For the best of all worlds when maximum flexibility is needed

With these techniques, Gaussian Processes transform from an elegant but impractical method into a production-ready tool for large-scale probabilistic machine learning.

Module Complete

5 / 5