Loading learning content...
The scalable GP methods we have explored so far—sparse approximations and variational inference—reduce complexity by compressing the GP through a small set of inducing points. Random feature methods take a fundamentally different approach: rather than approximating the GP structure, they approximate the kernel function itself.
The core insight, formalized in Rahimi and Recht's seminal 2007 paper, is that many kernel functions can be written as expectations over explicit feature maps:
$$k(\mathbf{x}, \mathbf{x}') = \mathbb{E}{\omega \sim p(\omega)}[\phi\omega(\mathbf{x})^\top \phi_\omega(\mathbf{x}')]$$
By sampling D random "frequencies" $\{\omega_1, ..., \omega_D\}$ from the distribution $p(\omega)$, we obtain a Monte Carlo approximation:
$$k(\mathbf{x}, \mathbf{x}') \approx \hat{\phi}(\mathbf{x})^\top \hat{\phi}(\mathbf{x}')$$
where $\hat{\phi}(\mathbf{x}) \in \mathbb{R}^D$ is an explicit, low-dimensional feature vector.
This transformation is revolutionary: it converts kernel methods, whose complexity scales with data size, into standard linear methods operating on D-dimensional features. GP inference becomes linear regression in feature space, reducing O(N³) to O(ND² + D³)—a remarkable improvement when D << N.
By the end of this page, you will understand the Bochner's theorem foundation that enables random features, master Random Fourier Features (RFF) for stationary kernels, explore advanced variants like Orthogonal Random Features and Quadrature Fourier Features, and appreciate both the power and limitations of this approach for GP scalability.
The mathematical foundation for random features comes from Bochner's theorem, a foundational result in harmonic analysis that characterizes positive-definite functions.
Bochner's Theorem (1933):
A continuous, stationary kernel $k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\tau})$ is positive-definite if and only if it is the Fourier transform of a non-negative measure $p(\boldsymbol{\omega})$:
$$k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{i \boldsymbol{\omega}^\top \boldsymbol{\tau}} p(\boldsymbol{\omega}) d\boldsymbol{\omega}$$
When $k(\mathbf{0}) = 1$ (normalized kernel), $p(\boldsymbol{\omega})$ is a proper probability distribution called the spectral density of the kernel.
Key insight:
The kernel can be written as an expectation:
$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}{\boldsymbol{\omega} \sim p}\left[e^{i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')}\right] = \mathbb{E}{\boldsymbol{\omega} \sim p}\left[e^{i \boldsymbol{\omega}^\top \mathbf{x}}\overline{e^{i \boldsymbol{\omega}^\top \mathbf{x}'}}\right]$$
where $\overline{\cdot}$ denotes complex conjugate.
For real-valued kernels, we can use the real part of the exponential:
$$k(\mathbf{x} - \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega}, b}\left[\sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x} + b) \cdot \sqrt{2}\cos(\boldsymbol{\omega}^\top \mathbf{x}' + b)\right]$$
where $b \sim \text{Uniform}[0, 2\pi)$.
| Kernel | k(τ) | Spectral Density p(ω) |
|---|---|---|
| Squared Exponential (RBF) | exp(-||τ||²/2ℓ²) | N(0, ℓ⁻²I) — Gaussian |
| Matérn (ν=1/2, Exp) | exp(-||τ||/ℓ) | Cauchy: (1 + ℓ²||ω||²)⁻¹ |
| Matérn (ν=3/2) | (1 + √3||τ||/ℓ)exp(-√3||τ||/ℓ) | Student-t-like |
| Matérn (ν→∞) | exp(-||τ||²/2ℓ²) | Gaussian (same as SE) |
| Periodic | exp(-2sin²(π||τ||/p)/ℓ²) | Bessel functions |
Bochner's theorem applies only to stationary (translation-invariant) kernels where k(x, x') = k(x - x'). Non-stationary kernels like the linear kernel or dot-product kernels require different techniques. However, many practical applications use stationary kernels, making random features broadly applicable.
Random Fourier Features (RFF) directly implement the Monte Carlo approximation to Bochner's expectation:
Algorithm:
Random Fourier Features Algorithm
Input: Kernel k with spectral density p(ω), number of features D
Output: Feature map φ: R^d → R^D
1. Sample {ω_1, ..., ω_{D/2}} ~ p(ω)
2. Sample {b_1, ..., b_{D/2}} ~ Uniform[0, 2π)
3. Define feature map:
φ(x) = sqrt(2/D) * [cos(ω_1^T x + b_1), sin(ω_1^T x + b_1),
cos(ω_2^T x + b_2), sin(ω_2^T x + b_2),
...,
cos(ω_{D/2}^T x + b_{D/2}), sin(ω_{D/2}^T x + b_{D/2})]
4. Return φ
The kernel is then approximated as: $$\hat{k}(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^\top \phi(\mathbf{x}')$$
For the Squared Exponential kernel with length-scale $\ell$:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
import numpy as npfrom scipy.stats import norm class RandomFourierFeatures: """ Random Fourier Features for kernel approximation. Implements the Rahimi & Recht (2007) algorithm for approximating stationary kernels via random sampling from the spectral density. """ def __init__(self, d: int, D: int, kernel_type: str = 'rbf', lengthscale: float = 1.0, variance: float = 1.0, seed: int = None): """ Args: d: Input dimensionality D: Number of random features (must be even for sin+cos) kernel_type: 'rbf' (squared exponential) or 'matern12' (exponential) lengthscale: Kernel length-scale parameter variance: Kernel variance (output scale) seed: Random seed for reproducibility """ self.d = d self.D = D self.kernel_type = kernel_type self.lengthscale = lengthscale self.variance = variance rng = np.random.RandomState(seed) # Sample frequencies from spectral density if kernel_type == 'rbf': # SE kernel: p(ω) = N(0, ℓ^{-2}I) self.omega = rng.randn(D // 2, d) / lengthscale elif kernel_type == 'matern12': # Exponential kernel: p(ω) ∝ (1 + ℓ²||ω||²)^{-(d+1)/2} # This is a multivariate Cauchy (Student-t with ν=1) # Sample via: ω = z / sqrt(u) where z~N(0,I), u~χ²(1) z = rng.randn(D // 2, d) u = rng.chisquare(1, size=(D // 2, 1)) self.omega = z / np.sqrt(u) / lengthscale else: raise ValueError(f"Unknown kernel type: {kernel_type}") # Sample phases self.b = rng.uniform(0, 2 * np.pi, size=D // 2) def transform(self, X: np.ndarray) -> np.ndarray: """ Compute random features φ(X). Args: X: (N, d) input points Returns: Phi: (N, D) random feature matrix """ N = len(X) D = self.D # Compute ω^T x + b for all points and frequencies projection = X @ self.omega.T + self.b # (N, D/2) # Compute cos and sin features Phi = np.zeros((N, D)) Phi[:, 0::2] = np.cos(projection) # Even indices Phi[:, 1::2] = np.sin(projection) # Odd indices # Scale by sqrt(2/D) and kernel variance Phi *= np.sqrt(2.0 / D) * np.sqrt(self.variance) return Phi def kernel_approximation(self, X1: np.ndarray, X2: np.ndarray = None) -> np.ndarray: """ Compute approximate kernel matrix K̂ = Φ(X1) @ Φ(X2)^T. """ Phi1 = self.transform(X1) if X2 is None: return Phi1 @ Phi1.T Phi2 = self.transform(X2) return Phi1 @ Phi2.T def demonstrate_rff_approximation(): """Show how RFF approximation improves with D.""" from numpy.linalg import norm np.random.seed(42) d = 5 N = 200 X = np.random.randn(N, d) # Exact SE kernel lengthscale = 1.0 dist_sq = np.sum((X[:, None] - X[None, :])**2, axis=2) K_exact = np.exp(-0.5 * dist_sq / lengthscale**2) print("RFF Approximation Quality vs Number of Features") print("=" * 55) print(f"{'D':>8} | {'Relative Error':>15} | {'Approx Time':>12}") print("-" * 55) for D in [10, 50, 100, 500, 1000, 5000]: import time start = time.perf_counter() rff = RandomFourierFeatures(d=d, D=D, kernel_type='rbf', lengthscale=lengthscale, seed=42) K_approx = rff.kernel_approximation(X) elapsed = time.perf_counter() - start rel_error = norm(K_exact - K_approx, 'fro') / norm(K_exact, 'fro') print(f"{D:>8} | {rel_error:>15.6f} | {elapsed*1000:>10.2f} ms") class RFFGaussianProcess: """ Gaussian Process regression using Random Fourier Features. Converts GP inference to Bayesian linear regression in feature space, reducing complexity from O(N³) to O(ND² + D³). """ def __init__(self, d: int, D: int, lengthscale: float = 1.0, variance: float = 1.0, noise_variance: float = 0.01, seed: int = None): self.rff = RandomFourierFeatures(d, D, 'rbf', lengthscale, variance, seed) self.noise_var = noise_variance self.D = D self.fitted = False def fit(self, X: np.ndarray, y: np.ndarray) -> 'RFFGaussianProcess': """ Fit the RFF-GP model. This is Bayesian linear regression: p(w) = N(0, I), y = Φw + ε Posterior: p(w|y) = N(m, S) """ Phi = self.rff.transform(X) # (N, D) # Posterior covariance: S = (Φ^T Φ / σ² + I)^{-1} A = Phi.T @ Phi / self.noise_var + np.eye(self.D) self.L = np.linalg.cholesky(A) # Posterior mean: m = S Φ^T y / σ² self.m = np.linalg.solve(self.L.T, np.linalg.solve(self.L, Phi.T @ y)) / self.noise_var self.fitted = True return self def predict(self, X_star: np.ndarray, return_var: bool = True): """ Predictive distribution at test points. """ if not self.fitted: raise RuntimeError("Model must be fitted before prediction") Phi_star = self.rff.transform(X_star) # (N*, D) # Mean: μ* = Φ* m mu = Phi_star @ self.m if not return_var: return mu # Variance: σ²* = Φ* S Φ*^T + σ²_n (diagonal only) # S = L^{-T} L^{-1}, so Φ* S Φ*^T diagonal = ||L^{-1} Φ*^T||² per column v = np.linalg.solve(self.L, Phi_star.T) # (D, N*) var = np.sum(v**2, axis=0) + self.noise_var return mu, var if __name__ == "__main__": demonstrate_rff_approximation() print("" + "=" * 55) print("RFF-GP Regression Example") print("=" * 55) np.random.seed(42) N, d = 5000, 3 X = np.random.randn(N, d) * 2 y = np.sin(X[:, 0]) + 0.5 * X[:, 1]**2 + 0.1 * np.random.randn(N) # Train RFF-GP import time start = time.perf_counter() model = RFFGaussianProcess(d=d, D=500, lengthscale=1.0, noise_variance=0.01) model.fit(X, y) train_time = time.perf_counter() - start # Predict X_test = np.random.randn(100, d) mu, var = model.predict(X_test) print(f"Training time (N={N}, D=500): {train_time:.3f}s") print(f"Prediction uncertainty: std ∈ [{np.sqrt(var).min():.3f}, {np.sqrt(var).max():.3f}]")One of RFF's attractive properties is formal guarantees on approximation quality. These bounds inform how many random features are needed for a given accuracy target.
Main Theorem (Rahimi & Recht, 2007):
For a bounded set $\mathcal{X}$ with diameter $L$, and D random Fourier features:
$$\mathbb{P}\left[\sup_{\mathbf{x}, \mathbf{x}' \in \mathcal{X}} |k(\mathbf{x}, \mathbf{x}') - \hat{k}(\mathbf{x}, \mathbf{x}')| \geq \epsilon\right] \leq 2^8 \left(\frac{\sigma_p L}{\epsilon}\right)^2 \exp\left(-\frac{D\epsilon^2}{4(d+2)}\right)$$
where $\sigma_p^2 = \mathbb{E}_{\omega \sim p}[\|\omega\|^2]$ is the second moment of the spectral density.
Interpretation:
To achieve uniform error $\epsilon$ with high probability $1 - \delta$:
$$D = O\left(\frac{d \log(L\sigma_p/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)$$
The number of features scales:
For typical applications, D = 500-5000 random features provide excellent approximation. The dimension d matters less than you might expect due to the log dependence on domain size. In practice, start with D = 1000 and increase if predictions are unsatisfactory.
Variance of the estimator:
The RFF kernel estimate is unbiased: $\mathbb{E}[\hat{k}] = k$. Its variance is:
$$\text{Var}[\hat{k}(\mathbf{x}, \mathbf{x}')] = \frac{1}{D}\left(\mathbb{E}_{\omega}[|e^{i\omega^\top(\mathbf{x}-\mathbf{x}')}|^2] - |k(\mathbf{x}-\mathbf{x}')|^2\right) = \frac{1 - k^2(\mathbf{x}-\mathbf{x}')}{D}$$
The variance is largest when $k \approx 0$ (distant points) and vanishes as $k \to 1$ (nearby points). This is favorable: we get the best approximation where the kernel is strongest, and accept more variance where the kernel contribution is negligible.
Concentration for learning:
For supervised learning, what matters is not kernel approximation per se, but whether predictions are accurate. Tighter analyses show that generalization error degrades gracefully with kernel approximation error, with the effective sample complexity increasing by approximately $(1 + \epsilon)$ where $\epsilon$ is the kernel approximation error.
The basic RFF has been extended in numerous ways to improve approximation quality, reduce variance, and handle more general kernels:
1. Orthogonal Random Features (ORF):
Yu et al. (2016) showed that orthogonalizing the random frequency matrix $\Omega$ reduces variance. Instead of sampling independently:
Standard RFF: Ω = [ω_1, ..., ω_{D/2}]^T, each ω_i ~ p(ω)
Orthogonal RF: Ω = S × Q, where S ~ p(ω) scales, Q is random orthogonal
ORF achieves the same expected value with lower variance, requiring fewer features for the same accuracy.
2. Quadrature Fourier Features (QFF):
Mutny & Krause (2018) replace Monte Carlo sampling with deterministic quadrature rules. For product spectral densities:
$$p(\omega) = \prod_{j=1}^d p_j(\omega_j)$$
Quadrature nodes provide "optimal" samples that minimize integration error. QFF is especially effective in low-to-moderate dimensions (d ≤ 10).
| Method | Variance | Complexity | Best For |
|---|---|---|---|
| Standard RFF | O(1/D) | O(NDd) | General purpose, simple implementation |
| Orthogonal RF (ORF) | O(1/D²) | O(NDd + D²d) | When variance reduction is critical |
| Quadrature FF (QFF) | O(exp(-D^{1/d})) | O(ND^{d/d}) | Low-dimensional inputs (d ≤ 5) |
| Structured ORF (SORF) | O(1/D²) | O(ND log D) | Large D with structured matrices |
3. Structured Orthogonal Random Features (SORF):
Le et al. (2013) use structured matrices (Hadamard, Fourier) to implement the orthogonal projection efficiently:
$$\Phi(\mathbf{x}) = \sqrt{\frac{1}{D}} \cdot \left[\begin{matrix} \cos(\Omega \mathbf{x}) \\ \sin(\Omega \mathbf{x}) \end{matrix}\right], \quad \Omega = \frac{1}{\sigma}SHD$$
where H is the Hadamard matrix, D is a random diagonal of ±1, and S is a diagonal scaling. The matrix-vector product $\Omega \mathbf{x}$ can be computed in O(d log d) via the fast Hadamard transform, giving total complexity O(N D log D) instead of O(NDd).
4. Generalized Spectral Kernels:
For non-stationary or structured kernels, alternative feature constructions exist:
Recent work (Wilson & Adams, 2013; Samo & Roberts, 2015) learns the spectral density from data, combining the flexibility of non-parametric spectral mixture kernels with the efficiency of random features. This enables adapting to complex periodic patterns and multi-scale structure while maintaining computational tractability.
With an explicit feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$, Gaussian Process regression transforms into Bayesian linear regression:
Model: $$f(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{w}, \quad \mathbf{w} \sim \mathcal{N}(\mathbf{0}, I_D)$$ $$y = f(\mathbf{x}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)$$
The prior over functions induced by this model has covariance: $$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \phi(\mathbf{x})^\top \phi(\mathbf{x}') = \hat{k}(\mathbf{x}, \mathbf{x}')$$
which approximates the target kernel.
Posterior inference:
Given data $(X, \mathbf{y})$ with feature matrix $\Phi = [\phi(\mathbf{x}_1), ..., \phi(\mathbf{x}_N)]^\top \in \mathbb{R}^{N \times D}$:
$$p(\mathbf{w}|\mathbf{y}) = \mathcal{N}(\mathbf{m}_w, S_w)$$ $$S_w = (\sigma_n^{-2}\Phi^\top\Phi + I_D)^{-1}$$ $$\mathbf{m}_w = \sigma_n^{-2}S_w\Phi^\top\mathbf{y}$$
Predictive distribution: $$p(f_|\mathbf{y}) = \mathcal{N}(\phi_^\top\mathbf{m}w, \phi^\top S_w \phi_)$$
Computational complexity:
| Operation | Exact GP | RFF-GP |
|---|---|---|
| Training | O(N³) | O(ND² + D³) |
| Prediction (mean) | O(N) | O(D) |
| Prediction (var) | O(N²) | O(D²) |
| Memory | O(N²) | O(ND + D²) |
For D << N, RFF-GP is dramatically faster. With D = 1000 and N = 100,000:
When to use RFF-GP vs Sparse GP:
RFF-GP is not a panacea. It only works for stationary kernels, it cannot adapt feature locations to data (unlike learned inducing points), and the approximation quality is uniform rather than data-adaptive. For highly non-uniform data or when predictive uncertainty near specific regions is critical, variational sparse GPs may be superior.
Learning kernel hyperparameters (length-scale, variance) in the RFF framework requires care because the random features depend on these parameters.
The challenge:
In standard RFF, frequencies $\omega_j$ are sampled from $p(\omega; \theta)$ where $\theta$ includes length-scales. Changing $\theta$ changes the spectral density, requiring new samples—but we want to optimize $\theta$ via gradient descent.
Solution 1: Reparameterization
For the SE kernel with length-scale $\ell$, sample base frequencies from $\mathcal{N}(0, I)$ and scale:
omega_base = np.random.randn(D//2, d) # Sample once
omega = omega_base / lengthscale # Scale by current hyperparameter
Now $\omega$ is differentiable with respect to $\ell$, enabling gradient-based optimization.
Solution 2: Implicit reparameterization
For more complex spectral densities, use the reparameterization trick: $$\omega = T(\epsilon; \theta), \quad \epsilon \sim p_0(\epsilon)$$
where $T$ is a differentiable transformation and $p_0$ is a fixed base distribution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
import numpy as npimport torchimport torch.nn as nnfrom torch.optim import Adam class LearnableRFFGP(nn.Module): """ RFF-GP with learnable hyperparameters via reparameterization. Key insight: sample base frequencies ω_0 ~ N(0, I) once, then scale by lengthscale: ω = ω_0 / ℓ This makes the likelihood differentiable w.r.t. ℓ. """ def __init__(self, d: int, D: int, seed: int = None): super().__init__() self.d = d self.D = D # Fixed base frequencies and phases (sampled once) rng = np.random.RandomState(seed) omega_base = rng.randn(D // 2, d) b = rng.uniform(0, 2 * np.pi, size=D // 2) self.register_buffer('omega_base', torch.tensor(omega_base, dtype=torch.float64)) self.register_buffer('b', torch.tensor(b, dtype=torch.float64)) # Learnable hyperparameters (in log space) self.log_lengthscale = nn.Parameter(torch.zeros(d, dtype=torch.float64)) self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64)) self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64)) @property def lengthscales(self): return torch.exp(self.log_lengthscale) @property def variance(self): return torch.exp(self.log_variance) @property def noise_variance(self): return torch.exp(self.log_noise) def features(self, X: torch.Tensor) -> torch.Tensor: """Compute RFF features with current hyperparameters.""" # Scale frequencies by lengthscales (ARD) omega = self.omega_base / self.lengthscales # (D/2, d) # Projections proj = X @ omega.T + self.b # (N, D/2) # Features Phi = torch.zeros(X.shape[0], self.D, dtype=torch.float64) Phi[:, 0::2] = torch.cos(proj) Phi[:, 1::2] = torch.sin(proj) # Scale by sqrt(2σ²/D) Phi = Phi * torch.sqrt(2 * self.variance / self.D) return Phi def log_marginal_likelihood(self, X: torch.Tensor, y: torch.Tensor) -> torch.Tensor: """ Compute log p(y | X, θ) for hyperparameter optimization. Uses the Bayesian linear regression marginal likelihood: log p(y) = log N(y; 0, Φ Φ^T + σ²I) Computed efficiently via Woodbury identity. """ Phi = self.features(X) # (N, D) N, D = Phi.shape # Cholesky of Φ^T Φ + σ²I (D×D matrix) A = Phi.T @ Phi + self.noise_variance * torch.eye(D, dtype=torch.float64) L_A = torch.linalg.cholesky(A) # log|Φ Φ^T + σ²I| = log|A| + (N-D)log(σ²) [Woodbury] log_det_A = 2 * torch.sum(torch.log(torch.diag(L_A))) log_det = log_det_A + (N - D) * self.log_noise # y^T (Φ Φ^T + σ²I)^{-1} y via Woodbury # = y^T y / σ² - y^T Φ (A)^{-1} Φ^T y / σ⁴ y_Phi = y @ Phi # (D,) A_inv_Phi_y = torch.linalg.solve(L_A.T, torch.linalg.solve(L_A, y_Phi)) quad = torch.dot(y, y) / self.noise_variance - torch.dot(y_Phi, A_inv_Phi_y) / self.noise_variance**2 lml = -0.5 * quad - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi) return lml def fit(self, X: np.ndarray, y: np.ndarray, n_iterations: int = 200, lr: float = 0.05): """Optimize hyperparameters via marginal likelihood.""" X_t = torch.tensor(X, dtype=torch.float64) y_t = torch.tensor(y, dtype=torch.float64) optimizer = Adam(self.parameters(), lr=lr) for i in range(n_iterations): optimizer.zero_grad() lml = self.log_marginal_likelihood(X_t, y_t) (-lml).backward() # Maximize LML optimizer.step() if (i + 1) % 50 == 0: ls = self.lengthscales.detach().numpy() print(f"Iter {i+1}: LML={lml.item():.2f}, " f"lengthscales={ls[:min(3,len(ls))]}, " f"noise={self.noise_variance.item():.4f}") return self def predict(self, X_train: torch.Tensor, y_train: torch.Tensor, X_test: torch.Tensor) -> tuple: """Compute predictive distribution at test points.""" with torch.no_grad(): Phi_train = self.features(X_train) Phi_test = self.features(X_test) # Posterior over weights A = Phi_train.T @ Phi_train / self.noise_variance + torch.eye(self.D, dtype=torch.float64) L = torch.linalg.cholesky(A) m = torch.linalg.solve(L.T, torch.linalg.solve(L, Phi_train.T @ y_train)) / self.noise_variance # Predictive moments mu = Phi_test @ m # Variance v = torch.linalg.solve(L, Phi_test.T) var = torch.sum(v**2, dim=0) + self.noise_variance return mu, var # Example: Learning hyperparametersif __name__ == "__main__": np.random.seed(42) torch.manual_seed(42) # Generate data with known structure N = 1000 d = 3 true_lengthscales = np.array([0.5, 1.0, 2.0]) # Different per dimension X = np.random.randn(N, d) # Function with ARD structure: first dimension matters most y = 3 * np.sin(X[:, 0] / 0.5) + 0.5 * np.cos(X[:, 1] / 1.0) + 0.1 * np.random.randn(N) print("Learning Hyperparameters with RFF-GP") print("=" * 50) print(f"True lengthscales (approx): {true_lengthscales}") print() model = LearnableRFFGP(d=d, D=500, seed=42) model.fit(X, y, n_iterations=150, lr=0.05) print(f"Learned lengthscales: {model.lengthscales.detach().numpy()}") print(f"Learned noise var: {model.noise_variance.item():.4f}")Both random features and sparse GPs provide O(NM²) or O(ND²) complexity, but they achieve this through fundamentally different mechanisms with distinct trade-offs:
| Aspect | Random Features (RFF) | Sparse GPs (VFE/SVGP) |
|---|---|---|
| Approximation target | Kernel function globally | GP posterior locally |
| Kernel types | Stationary only | Any positive-definite |
| Adaptivity | Uniform across input space | Concentrated near data |
| Hyperparameter learning | Via marginal likelihood | Via ELBO |
| Uncertainty | May underestimate in sparse regions | Better near inducing points |
| Implementation | Simpler (linear algebra) | More complex (Cholesky, etc.) |
| Prediction cost | O(D) per point | O(M) per point |
| Minibatch training | Straightforward | Requires variational formulation |
| Memory | O(D²) for posterior | O(M²) for K_uu inverse |
When to choose Random Features:
When to choose Sparse GPs:
Recent work combines RFF with sparse GPs: use random features as the base kernel, then apply sparse methods on top. This provides the best of both worlds—global approximation quality from RFF and local adaptation from inducing points. Libraries like GPyTorch support these composable scalability approaches.
We have developed the random feature framework for scalable Gaussian Processes—an elegant approach that approximates the kernel itself rather than the GP structure. The key insights from this module's final page:
Module conclusion:
This module has comprehensively covered scalable Gaussian Process methods—from understanding why exact GPs don't scale, through sparse approximations and inducing point selection, to the principled variational framework and random feature alternatives. You now possess the full toolkit for deploying Gaussian Processes at scale:
With these techniques, Gaussian Processes transform from an elegant but impractical method into a production-ready tool for large-scale probabilistic machine learning.
Congratulations! You have completed the Scalable GPs module. You now understand the computational barriers of exact GP inference, the theory and practice of sparse approximations, variational inference for scalability, and random feature methods. These techniques power real-world GP deployments at companies like Uber, Google, and Amazon, handling predictions at scales that would have seemed impossible a decade ago.