Loading content...
The sparse GP methods we have explored—DTC, FITC, and their variants—represent pragmatic approximations that reduce computational complexity. But they carry an uncomfortable property: they lack principled justification as inference procedures. We introduced approximations that seemed reasonable, verified they were computationally tractable, and hoped the resulting predictions would be sensible.
Variational Sparse Gaussian Processes transform this landscape. By recasting sparse GP inference within the framework of variational inference, we obtain:
This variational perspective—initiated by Titsias (2009) and refined in subsequent work—represents the modern paradigm for scalable Gaussian Processes. Understanding this framework is essential for any serious practitioner.
This page develops the variational sparse GP framework from first principles, derives the key ELBO objective, and demonstrates how this enables scalable inference on datasets previously considered impossible for GP methods.
By the end of this page, you will understand how variational inference applies to sparse GPs, derive the ELBO objective step by step, see how DTC emerges as an optimal variational distribution, appreciate the crucial trace term correction, and implement stochastic variational inference for GPs that scales to millions of data points.
Recall the general variational inference framework. Given data $\mathbf{y}$ and latent variables $\mathbf{z}$, we want the posterior $p(\mathbf{z}|\mathbf{y})$. When this is intractable, we approximate it with a tractable distribution $q(\mathbf{z})$ by minimizing the KL divergence:
$$\text{KL}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{y})) = \int q(\mathbf{z}) \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{y})} d\mathbf{z}$$
This is equivalent to maximizing the Evidence Lower Bound (ELBO):
$$\mathcal{L}(q) = \log p(\mathbf{y}) - \text{KL}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{y})) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{y}, \mathbf{z})] - \mathbb{E}{q(\mathbf{z})}[\log q(\mathbf{z})]$$
Since KL divergence is non-negative, $\mathcal{L}(q) \leq \log p(\mathbf{y})$ with equality when $q = p(\cdot|\mathbf{y})$.
Application to sparse GPs:
In the sparse GP setting:
The joint prior from the GP is:
$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$
where $p(\mathbf{u}) = \mathcal{N}(\mathbf{0}, K_{uu})$ and $p(\mathbf{f}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$.
The variational approach treats inducing values u as the primary latent variables. Rather than approximating the N×N kernel matrix, we approximate the posterior p(u|y).Given the posterior on u, we can reconstruct the posterior on f via the GP conditional. This shifts the complexity from N to M.
Following Titsias (2009), we choose a variational distribution over the joint $(\mathbf{f}, \mathbf{u})$ with specific structure:
$$q(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})q(\mathbf{u})$$
Critically, we retain the exact GP conditional $p(\mathbf{f}|\mathbf{u})$ from the prior. The approximation affects only the marginal over inducing variables, where we use:
$$q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, S)$$
with variational parameters $\mathbf{m} \in \mathbb{R}^M$ (mean) and $S \in \mathbb{R}^{M \times M}$ (covariance, constrained to be positive definite).
Why this structure?
The key insight is that under the GP prior, $p(\mathbf{f}|\mathbf{u})$ is already available in closed form. By keeping this conditional exact, we:
The marginal variational distribution over $\mathbf{f}$ becomes:
$$q(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u})q(\mathbf{u})d\mathbf{u} = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{m}, K_{ff} - K_{fu}K_{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}K_{uf})$$
Note that when $S = K_{uu}$, the variance reduces to $K_{ff}$—recovering the prior. When $S \to 0$, we get the DTC variance $K_{fu}K_{uu}^{-1}K_{uf}$.
Predictive distribution:
For a new test point $\mathbf{x}_*$, the variational posterior predictive is:
$$q(f_) = \int p(f_|\mathbf{u})q(\mathbf{u})d\mathbf{u} = \mathcal{N}(\mu_, \sigma_^2)$$
where:
$$\mu_* = \mathbf{k}{*u}K{uu}^{-1}\mathbf{m}$$ $$\sigma_^2 = k_{**} - \mathbf{k}{*u}K{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}\mathbf{k}_{u}$$
The mean involves only the M-dimensional inducing representation, making predictions O(M²) per test point. The variance correctly incorporates both prior uncertainty and posterior compression through $S$.
Setting q(u) to a point mass δ(u - K_{uu}^{-1}K_{uf}y) (deterministic) recovers DTC. FITC corresponds to a different approximation that retains diagonal variance. The variational framework's power is that we optimize over q(u) rather than fixing it heuristically.
We now derive the Evidence Lower Bound for the variational sparse GP. The complete ELBO is:
$$\mathcal{L} = \mathbb{E}_{q(\mathbf{f},\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$
Step 1: Expand using the variational structure
$$\mathcal{L} = \mathbb{E}{q(\mathbf{u})}\left[\mathbb{E}{p(\mathbf{f}|\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})]\right] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$
The inner expectation is over the exact GP conditional, and the outer is over the variational distribution on $\mathbf{u}$.
Step 2: Evaluate the expected log-likelihood
For Gaussian likelihood $p(\mathbf{y}|\mathbf{f}) = \mathcal{N}(\mathbf{y}; \mathbf{f}, \sigma_n^2 I)$:
$$\log p(\mathbf{y}|\mathbf{f}) = -\frac{N}{2}\log(2\pi\sigma_n^2) - \frac{1}{2\sigma_n^2}\|\mathbf{y} - \mathbf{f}\|^2$$
The expectation over $q(\mathbf{f})$ gives:
$$\mathbb{E}{q(\mathbf{f})}[\log p(\mathbf{y}|\mathbf{f})] = -\frac{N}{2}\log(2\pi\sigma_n^2) - \frac{1}{2\sigma_n^2}\left(\|\mathbf{y} - \mathbb{E}{q}[\mathbf{f}]\|^2 + \text{tr}(\text{Var}_q[\mathbf{f}])\right)$$
Step 3: Substitute the variational moments
With $\mathbb{E}{q}[\mathbf{f}] = K{fu}K_{uu}^{-1}\mathbf{m}$ and $$\text{Var}q[\mathbf{f}] = K{ff} - K_{fu}K_{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}K_{uf}$$
The expected log-likelihood becomes:
$$\mathbb{E}{q}[\log p(\mathbf{y}|\mathbf{f})] = \log \mathcal{N}(\mathbf{y}; K{fu}K_{uu}^{-1}\mathbf{m}, \sigma_n^2 I) - \frac{1}{2\sigma_n^2}\text{tr}(\tilde{K})$$
where $\tilde{K} = K_{ff} - K_{fu}K_{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}K_{uf} = K_{ff} - Q_{ff} + K_{fu}K_{uu}^{-1}SK_{uu}^{-1}K_{uf}$.
Step 4: The complete ELBO
Combining with the KL divergence between Gaussians:
$$\text{KL}(q(\mathbf{u}) \| p(\mathbf{u})) = \frac{1}{2}\left(\text{tr}(K_{uu}^{-1}S) + \mathbf{m}^\top K_{uu}^{-1}\mathbf{m} - M + \log\frac{|K_{uu}|}{|S|}\right)$$
The full ELBO is:
$$\mathcal{L} = \log \mathcal{N}(\mathbf{y}; K_{fu}K_{uu}^{-1}\mathbf{m}, \sigma_n^2 I) - \frac{\text{tr}(K_{ff} - Q_{ff})}{2\sigma_n^2} - \frac{\text{tr}(K_{fu}K_{uu}^{-1}SK_{uu}^{-1}K_{uf})}{2\sigma_n^2} - \text{KL}(q \| p)$$
The term tr(K_{ff} - Q_{ff})/(2σ²) is called the trace correction or approximation penalty. It measures how much variance the inducing points fail to capture and penalizes the ELBO accordingly. This is what distinguishes variational sparse GPs from DTC—DTC ignores this term, leading to overconfident predictions. The trace term provides principled regularization.
A remarkable property of the variational sparse GP ELBO is that the optimal $q(\mathbf{u})$ can be computed in closed form. Taking derivatives and setting to zero:
Optimal covariance:
$$S^* = (K_{uu}^{-1} + \sigma_n^{-2}K_{uu}^{-1}K_{uf}K_{fu}K_{uu}^{-1})^{-1}$$
Using the Woodbury identity:
$$S^* = K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uu}$$
Optimal mean:
$$\mathbf{m}^* = \sigma_n^{-2}S^*K_{uu}^{-1}K_{uf}\mathbf{y}$$
Simplifying:
$$\mathbf{m}^* = \sigma_n^{-2}K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uf}\mathbf{y}$$
Interpretation:
These optimal parameters have a clean interpretation. The posterior covariance $S^$ is exactly what you would get by running GP inference on the inducing points themselves, treating $K_{fu}K_{uu}^{-1}\mathbf{u}$ as a linear observation model. The posterior mean $\mathbf{m}^$ is the corresponding posterior mean under this model.
Plugging these optimal values back into the ELBO gives the collapsed ELBO (also called the VFE bound after Titsias):
$$\mathcal{L}^* = \log \mathcal{N}(\mathbf{y}; \mathbf{0}, Q_{ff} + \sigma_n^2 I) - \frac{\text{tr}(K_{ff} - Q_{ff})}{2\sigma_n^2}$$
Computing the collapsed ELBO requires: (1) O(NM²) for matrix products involving K_{fu}, (2) O(M³) for Cholesky of the M×M matrix, (3) O(N) for the trace of K_{ff} (just the diagonal sum). The total is O(NM² + M³), matching classical sparse methods but with a principled objective.
The variational framework enables a crucial capability: stochastic optimization with minibatches. The collapsed ELBO can be decomposed as a sum over data points:
$$\mathcal{L} = \sum_{n=1}^N \mathcal{L}_n - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$
where $\mathcal{L}_n$ depends only on the n-th datapoint $(\mathbf{x}_n, y_n)$ and the global parameters $(\mathbf{m}, S, Z, \theta)$.
This additive structure allows us to estimate the ELBO using random minibatches:
$$\mathcal{L} \approx \frac{N}{B}\sum_{n \in \mathcal{B}} \mathcal{L}_n - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$
where $\mathcal{B}$ is a minibatch of B << N datapoints.
Complexity per iteration:
With minibatches, each gradient step requires:
For B = 256 and M = 1000, this is approximately 260 million operations—tractable on a single GPU in milliseconds. Compare this to exact GP's O(N³): for N = 1,000,000, exact inference is completely intractable, while stochastic sparse GP training requires only hours of computation.
The SVGP algorithm:
Algorithm: Stochastic Variational Gaussian Process (SVGP)
Input: Data {(x_n, y_n)}, inducing points Z, batch size B, learning rate η
Output: Variational parameters m, S; hyperparameters θ
1. Initialize m = 0, S = K_uu, θ to reasonable values
2. For t = 1 to T:
a. Sample minibatch B ⊂ {1, ..., N}
b. Compute stochastic ELBO gradient:
∇ ≈ (N/B) Σ_{n∈B} ∇L_n - ∇KL(q||p)
c. Update parameters via gradient ascent:
(m, S, Z, θ) ← (m, S, Z, θ) + η · ∇
3. Return optimal parameters
Natural gradients:
For the variational posterior $q(\mathbf{u})$, natural gradients can dramatically accelerate convergence. The natural gradient pre-conditions the Euclidean gradient by the inverse Fisher information matrix, which for Gaussians has closed form. Natural gradients for $(\mathbf{m}, S)$ often converge in 10× fewer iterations.
With SVGP, Gaussian Processes become applicable to datasets with millions of points. Training proceeds analogously to neural network training: shuffle data, iterate through minibatches, update parameters via gradient descent. A well-tuned SVGP on 1M points with M=1000 inducing points can train in under an hour on a modern GPU.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227
import numpy as npimport torchimport torch.nn as nnfrom torch.optim import Adam class SVGP(nn.Module): """ Stochastic Variational Gaussian Process (SVGP). Implements the Titsias (2009) / Hensman et al. (2013) variational framework with minibatch training for scalability to large datasets. """ def __init__(self, X_train: np.ndarray, M: int = 100, jitter: float = 1e-4): super().__init__() N, d = X_train.shape self.N = N self.M = M self.d = d self.jitter = jitter # Initialize inducing points via k-means from scipy.cluster.vq import kmeans2 Z_init, _ = kmeans2(X_train, M, minit='++') # Learnable inducing point locations self.Z = nn.Parameter(torch.tensor(Z_init, dtype=torch.float64)) # Variational parameters # Mean: m ∈ R^M self.m = nn.Parameter(torch.zeros(M, dtype=torch.float64)) # Covariance: S = L_S @ L_S^T (Cholesky parameterization ensures PSD) L_S_init = torch.eye(M, dtype=torch.float64) self.L_S_raw = nn.Parameter(L_S_init) # Kernel hyperparameters (in log space for positivity) self.log_lengthscale = nn.Parameter(torch.zeros(d, dtype=torch.float64)) self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64)) self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64)) @property def S(self): """Covariance matrix from Cholesky factor.""" L = torch.tril(self.L_S_raw) return L @ L.T def kernel(self, X1, X2): """ARD squared exponential kernel.""" lengthscales = torch.exp(self.log_lengthscale) variance = torch.exp(self.log_variance) # Scaled distances X1_scaled = X1 / lengthscales X2_scaled = X2 / lengthscales dist_sq = torch.sum(X1_scaled**2, dim=1, keepdim=True) + torch.sum(X2_scaled**2, dim=1) - 2 * X1_scaled @ X2_scaled.T return variance * torch.exp(-0.5 * dist_sq) def elbo(self, X_batch: torch.Tensor, y_batch: torch.Tensor) -> torch.Tensor: """ Compute stochastic ELBO for a minibatch. Args: X_batch: (B, d) minibatch inputs y_batch: (B,) minibatch targets Returns: Stochastic ELBO estimate """ B = len(X_batch) M = self.M noise_var = torch.exp(self.log_noise) # Kernel matrices Kuu = self.kernel(self.Z, self.Z) + self.jitter * torch.eye(M, dtype=torch.float64) Kbu = self.kernel(X_batch, self.Z) # B x M Kbb_diag = torch.exp(self.log_variance) * torch.ones(B, dtype=torch.float64) # Diagonal only # Cholesky of Kuu L_uu = torch.linalg.cholesky(Kuu) # α = L_uu^{-1} K_ub^T (M x B) alpha = torch.linalg.solve_triangular(L_uu, Kbu.T, upper=False) # Q_bb diagonal: diag(K_bu K_uu^{-1} K_ub) = column sums of α² Qbb_diag = torch.sum(alpha**2, dim=0) # Variational mean at batch points: μ_b = K_bu K_uu^{-1} m Kuu_inv_m = torch.linalg.solve_triangular( L_uu.T, torch.linalg.solve_triangular(L_uu, self.m, upper=False), upper=True ) mu_batch = Kbu @ Kuu_inv_m # Expected log-likelihood # E_q[log p(y|f)] = log N(y; μ, σ²) - (1/2σ²) * E_q[(f - μ)²] # = log N(y; μ_q, σ²) - (1/2σ²) * Var_q[f] # Variance from variational posterior # Var_q[f_n] = k_nn - k_nu K_uu^{-1} (K_uu - S) K_uu^{-1} k_un # = k_nn - Q_nn + k_nu K_uu^{-1} S K_uu^{-1} k_un # For computational efficiency, compute the S contribution # K_uu^{-1} S K_uu^{-1} = (L_uu^{-1} S L_uu^{-T}) L_uu_inv_S = torch.linalg.solve_triangular(L_uu, self.S, upper=False) S_contrib = torch.linalg.solve_triangular(L_uu.T, L_uu_inv_S.T, upper=True).T # Variance at batch points # Using: var = k_nn - Q_nn + Σ contribution var_S_part = torch.sum((Kbu @ S_contrib) * Kbu, dim=1) var_batch = Kbb_diag - Qbb_diag + var_S_part # Expected log-likelihood per point # E[log p(y|f)] = -0.5*log(2πσ²) - (y-μ)²/(2σ²) - var/(2σ²) sq_error = (y_batch - mu_batch)**2 ell = -0.5 * torch.log(2 * np.pi * noise_var) - sq_error / (2 * noise_var) - var_batch / (2 * noise_var) # Scale by N/B for stochastic estimate expected_ll = (self.N / B) * torch.sum(ell) # Trace term: tr(K_ff - Q_ff) # Approximate with batch: (N/B) * tr(K_bb - Q_bb) trace_term = (self.N / B) * torch.sum(Kbb_diag - Qbb_diag) / (2 * noise_var) # KL divergence: KL(q(u) || p(u)) # = 0.5 * (tr(K_uu^{-1} S) + m^T K_uu^{-1} m - M + log|K_uu| - log|S|) L_S = torch.tril(self.L_S_raw) kl_trace = torch.sum(torch.linalg.solve_triangular(L_uu, L_S, upper=False)**2) kl_quad = torch.dot(Kuu_inv_m, self.m) kl_logdet = 2 * torch.sum(torch.log(torch.diag(L_uu))) - 2 * torch.sum(torch.log(torch.abs(torch.diag(L_S)))) kl = 0.5 * (kl_trace + kl_quad - M + kl_logdet) # ELBO = E[log p(y|f)] - trace_term - KL elbo = expected_ll - trace_term - kl return elbo def predict(self, X_star: torch.Tensor): """Predictive mean and variance at test points.""" with torch.no_grad(): Ksu = self.kernel(X_star, self.Z) Kss_diag = torch.exp(self.log_variance) * torch.ones(len(X_star), dtype=torch.float64) Kuu = self.kernel(self.Z, self.Z) + self.jitter * torch.eye(self.M, dtype=torch.float64) L_uu = torch.linalg.cholesky(Kuu) # Mean Kuu_inv_m = torch.linalg.solve_triangular( L_uu.T, torch.linalg.solve_triangular(L_uu, self.m, upper=False), upper=True ) mu = Ksu @ Kuu_inv_m # Variance (incorporating S) alpha = torch.linalg.solve_triangular(L_uu, Ksu.T, upper=False) Qss_diag = torch.sum(alpha**2, dim=0) L_S = torch.tril(self.L_S_raw) L_uu_inv_S = torch.linalg.solve_triangular(L_uu, self.S, upper=False) correction = torch.linalg.solve_triangular(L_uu.T, L_uu_inv_S.T, upper=True).T var_S = torch.sum((Ksu @ correction) * Ksu, dim=1) var = Kss_diag - Qss_diag + var_S return mu, torch.clamp(var, min=1e-6) def train_svgp(X, y, M=100, batch_size=256, n_epochs=50, lr=0.01): """Train SVGP with minibatch optimization.""" model = SVGP(X, M=M) X_tensor = torch.tensor(X, dtype=torch.float64) y_tensor = torch.tensor(y, dtype=torch.float64) optimizer = Adam(model.parameters(), lr=lr) N = len(X) print(f"Training SVGP: N={N}, M={M}, batch_size={batch_size}") print("=" * 50) for epoch in range(n_epochs): # Shuffle data perm = torch.randperm(N) epoch_elbo = 0.0 n_batches = 0 for i in range(0, N, batch_size): idx = perm[i:i+batch_size] X_batch = X_tensor[idx] y_batch = y_tensor[idx] optimizer.zero_grad() elbo = model.elbo(X_batch, y_batch) (-elbo).backward() # Minimize negative ELBO optimizer.step() epoch_elbo += elbo.item() n_batches += 1 avg_elbo = epoch_elbo / n_batches if (epoch + 1) % 10 == 0: print(f"Epoch {epoch+1}: Avg ELBO = {avg_elbo:.2f}") return model # Example usageif __name__ == "__main__": np.random.seed(42) torch.manual_seed(42) # Generate larger dataset to demonstrate scalability N = 10000 X = np.random.randn(N, 2) * 3 y = np.sin(X[:, 0]) + 0.5 * np.cos(0.5 * X[:, 1]) + 0.1 * np.random.randn(N) model = train_svgp(X, y, M=100, batch_size=256, n_epochs=30) # Make predictions X_test = np.random.randn(100, 2) mu, var = model.predict(torch.tensor(X_test, dtype=torch.float64)) print(f"\nPredictions: mean in [{mu.min():.2f}, {mu.max():.2f}]") print(f"Uncertainty: std in [{var.sqrt().min():.3f}, {var.sqrt().max():.3f}]")The variational sparse GP framework has been extended in numerous directions since its introduction. Key innovations include:
1. Inter-domain Variational GPs (Lázaro-Gredilla & Figueiras-Vidal, 2009):
Inducing points need not be in the same domain as training points. "Inter-domain" inducing variables can represent integrals, derivatives, or Fourier coefficients of the function. This enables more expressive representations with fewer inducing points.
2. Deep Gaussian Processes (Damianou & Lawrence, 2013):
Stack multiple GP layers, where the output of one GP becomes the input to the next. Variational sparse GPs make deep GPs computationally tractable:
$$\mathbf{f}^{(1)} \sim \mathcal{GP}(0, k^{(1)}(\mathbf{x}, \mathbf{x}'))$$ $$\mathbf{f}^{(2)} \sim \mathcal{GP}(0, k^{(2)}(\mathbf{f}^{(1)}, \mathbf{f}^{(1)'}))$$
Each layer uses its own inducing points and variational posterior.
3. Multi-output GPs (Alvarez et al., 2010):
Extend to vector-valued outputs with correlated components. Sparse approximations make multi-output GPs practical for multi-task and multi-fidelity learning.
4. Non-Gaussian likelihoods:
The variational framework extends beyond Gaussian noise to classification, count data, and other likelihoods. The expected log-likelihood no longer has closed form, requiring additional approximations:
5. Orthogonally Decoupled Variational GPs (Salimbeni et al., 2018):
Parameterize the variational posterior more flexibly using orthogonal bases. This improves expressiveness while maintaining tractability.
6. Sparse Variational GPs with Streaming Data (Bui et al., 2017):
Extend to online settings where data arrives sequentially. The variational posterior is updated incrementally without storing all historical data.
Modern GP libraries implement variational sparse GPs as their primary scalable method: GPyTorch (PyTorch), GPflow (TensorFlow), and Pyro (PyTorch probabilistic programming) all provide production-quality SVGP implementations with GPU support, natural gradients, and automatic differentiation for hyperparameter learning.
Deploying stochastic variational GPs in practice requires attention to several details that can significantly impact performance:
| Parameter | Typical Range | Guidance |
|---|---|---|
| Inducing points (M) | 100-2000 | Start with √N; increase if ELBO plateaus with more iterations |
| Batch size (B) | 64-1024 | Larger is more stable; smaller allows faster iterations |
| Learning rate | 0.001-0.1 | Use Adam; reduce if unstable, increase if slow |
| Epochs | 50-500 | Monitor validation ELBO for convergence |
| Natural gradients | On/Off | Enable for m, S learning; often 2-10× faster |
SVGP can overfit the ELBO if trained too long, especially with learned inducing points. The noise variance σ² may shrink to near-zero, the trace term may dominate, and predictions become overconfident. Use validation-based early stopping and regularize the noise variance lower bound.
We have developed the variational sparse GP framework—the foundation of modern scalable Gaussian Process methods. The key insights from this page:
What's next:
The final page in this module explores an orthogonal approach to GP scalability: random feature approximations. Rather than compressing through inducing points, random features approximate the kernel itself via finite-dimensional feature maps. This provides different trade-offs and is particularly effective for certain kernel families. Together with variational sparse GPs, these methods cover the full spectrum of scalable GP techniques.
You now understand the variational sparse GP framework that powers modern scalable GP implementations. The ELBO objective, stochastic optimization, and principled uncertainty propagation enable Gaussian Processes at scales that were previously unthinkable—from domains like computer vision and natural language processing to industrial applications with millions of observations.