Loading content...
Gaussian Processes (GPs) are among the most powerful tools in probabilistic machine learning. They provide not just predictions, but complete predictive distributions—means and variances that quantify epistemic uncertainty. This makes GPs invaluable for applications where knowing "how sure are we?" is as important as the prediction itself: active learning, Bayesian optimization, safety-critical systems, and more.
But GPs have a problem: they don't scale. The exact GP posterior requires $O(n^3)$ computation and $O(n^2)$ memory—precisely the limitations we've been discussing throughout this module.
Sparse Gaussian Processes solve this by introducing inducing points—a small set of $m$ pseudo-observations that summarize the entire dataset. Unlike Nyström (which also uses representative points), sparse GPs work within a principled Bayesian framework, optimizing the inducing point locations and values through variational inference.
The result is a GP that scales to millions of points while retaining full uncertainty quantification.
By the end of this page, you will understand the inducing point framework and why it enables scalability, the key sparse GP approximations (FITC, VFE, SVGP), how variational inference enables minibatch training, inducing point optimization strategies, and the relationship between sparse GPs and other kernel approximations.
The core idea of sparse GPs is to introduce inducing variables—function values at a set of $m$ inducing locations $\mathbf{Z} = {\mathbf{z}_1, \ldots, \mathbf{z}_m}$ (also called inducing inputs or pseudo-inputs).
The setup:
Let $\mathbf{f} = (f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n))^T$ be the function values at training inputs, and $\mathbf{u} = (f(\mathbf{z}_1), \ldots, f(\mathbf{z}_m))^T$ be the function values at inducing locations.
Under the GP prior:
$$\begin{bmatrix} \mathbf{f} \ \mathbf{u} \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} \mathbf{K}{nn} & \mathbf{K}{nm} \ \mathbf{K}{mn} & \mathbf{K}{mm} \end{bmatrix} \right)$$
where:
If we can express the training function values $\mathbf{f}$ in terms of the inducing values $\mathbf{u}$, then inference involves only the $m \times m$ matrix $\mathbf{K}{mm}$ instead of the $n \times n$ matrix $\mathbf{K}{nn}$. The challenge is doing this approximation in a principled way that preserves GP properties.
The conditional distribution:
From standard Gaussian conditioning, the distribution of $\mathbf{f}$ given $\mathbf{u}$ is:
$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{u}, \mathbf{K}{nn} - \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}{mn})$$
Define:
The graphical model:
Sparse GPs assume conditional independence given $\mathbf{u}$:
$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u})$$
This structure means:
| Symbol | Size | Meaning |
|---|---|---|
| $\mathbf{Z}$ | $m \times d$ | Inducing input locations (learnable) |
| $\mathbf{u}$ | $m \times 1$ | Function values at inducing locations |
| $\mathbf{K}_{mm}$ | $m \times m$ | Kernel matrix among inducing points |
| $\mathbf{K}_{nm}$ | $n \times m$ | Kernel between training and inducing |
| $\mathbf{Q}_{nn}$ | $n \times n$ | Low-rank approximation: $\mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}_{mn}$ |
Different sparse GP methods make different assumptions about how $\mathbf{f}$ relates to $\mathbf{u}$. The two most important are FITC (Fully Independent Training Conditional) and VFE (Variational Free Energy).
FITC (Snelson & Ghahramani, 2006):
FITC assumes the off-diagonal elements of the residual covariance $\boldsymbol{\Sigma}_f$ are zero:
$$p_{\text{FITC}}(\mathbf{f} | \mathbf{u}) = \prod_{i=1}^{n} p(f_i | \mathbf{u}) = \mathcal{N}(\boldsymbol{\mu}_f, \text{diag}(\boldsymbol{\Sigma}_f))$$
This makes training points conditionally independent given $\mathbf{u}$. The resulting approximate likelihood is:
$$p_{\text{FITC}}(\mathbf{y} | \mathbf{u}) = \mathcal{N}(\boldsymbol{\mu}_f, \text{diag}(\boldsymbol{\Sigma}_f) + \sigma^2 \mathbf{I})$$
Pros: Fast, captures heteroscedastic residual variance Cons: Approximation is heuristic, not principled; can underestimate uncertainty
VFE (Titsias, 2009):
VFE takes a variational inference approach. Instead of modifying the model, it finds the best approximation to the true posterior within a constrained family.
The variational objective (ELBO) is:
$$\mathcal{L} = \log \mathcal{N}(\mathbf{y} | \mathbf{0}, \mathbf{Q}{nn} + \sigma^2 \mathbf{I}) - \frac{1}{2\sigma^2} \text{tr}(\mathbf{K}{nn} - \mathbf{Q}_{nn})$$
The first term is the marginal likelihood under the Nyström approximation. The second term is a trace penalty that measures how much variance is lost by the approximation.
Key insight: The trace term $\text{tr}(\mathbf{K}{nn} - \mathbf{Q}{nn})$ penalizes poor inducing point placement. If the inducing points span the training data well, $\mathbf{Q}{nn} \approx \mathbf{K}{nn}$ and the penalty is small. If they miss important regions, the penalty is large.
VFE is a lower bound on the true marginal likelihood. As $m \to n$ with inducing points at training locations, VFE recovers the exact GP. This is a principled approximation that improves systematically as more inducing points are added.
| Aspect | FITC | VFE |
|---|---|---|
| Derivation | Modify model (heuristic) | Variational inference (principled) |
| Objective | Modified marginal likelihood | Evidence Lower Bound (ELBO) |
| Residual variance | Diagonal (heteroscedastic) | Absorbed into trace penalty |
| Uncertainty | Can underestimate | Conservative (proper bound) |
| Recovery at m=n | Not exact | Exact GP recovered |
| Computational complexity | $O(nm^2)$ | $O(nm^2)$ |
The VFE objective in detail:
Maximizing the ELBO:
$$\mathcal{L} = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{Q}{nn} + \sigma^2\mathbf{I}| - \frac{1}{2}\mathbf{y}^T(\mathbf{Q}{nn} + \sigma^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2\sigma^2}\text{tr}(\mathbf{K}{nn} - \mathbf{Q}{nn})$$
Using the Woodbury identity, we can evaluate this in $O(nm^2 + m^3)$ instead of $O(n^3)$:
$$|\mathbf{Q}{nn} + \sigma^2\mathbf{I}| = \sigma^{2n} |\mathbf{K}{mm}|^{-1} |\mathbf{K}{mm} + \sigma^{-2}\mathbf{K}{mn}\mathbf{K}_{nm}|$$
Similarly, $(\mathbf{Q}_{nn} + \sigma^2\mathbf{I})^{-1}$ can be computed efficiently using matrix inversion lemmas.
While VFE reduces complexity from $O(n^3)$ to $O(nm^2)$, we still need to process all $n$ data points. For truly massive datasets (millions of points), even $O(nm^2)$ is prohibitive.
Stochastic Variational Gaussian Processes (SVGP) (Hensman et al., 2013, 2015) enable minibatch training by reformulating the variational objective to decompose over data points.
The key innovation:
SVGP introduces an explicit variational distribution over the inducing variables:
$$q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S})$$
where $\mathbf{m} \in \mathbb{R}^m$ and $\mathbf{S} \in \mathbb{R}^{m \times m}$ (positive definite) are variational parameters to be optimized.
The ELBO becomes:
$$\mathcal{L} = \sum_{i=1}^{n} \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) | p(\mathbf{u}))$$
where $q(f_i)$ is the marginal of $f_i$ under $q(\mathbf{u})$.
The likelihood term is a sum over data points! We can approximate it with a minibatch:
$$\sum_{i=1}^{n} \mathbb{E}[\log p(y_i|f_i)] \approx \frac{n}{|B|} \sum_{i \in B} \mathbb{E}[\log p(y_i|f_i)]$$
where $B$ is a random minibatch. This gives an unbiased estimate of the gradient, enabling stochastic optimization.
Computing the ELBO:
$$\text{KL}(q(\mathbf{u}) | p(\mathbf{u})) = \frac{1}{2}\left[\text{tr}(\mathbf{K}{mm}^{-1}\mathbf{S}) + \mathbf{m}^T\mathbf{K}{mm}^{-1}\mathbf{m} - m + \log\frac{|\mathbf{K}_{mm}|}{|\mathbf{S}|}\right]$$
Cost: $O(m^3)$ for Cholesky of $\mathbf{K}_{mm}$ and $\mathbf{S}$.
$$q(f_i) = \mathcal{N}(\mu_i, \sigma_i^2)$$
where:
with $\mathbf{k}_i = [k(\mathbf{z}_1, \mathbf{x}_i), \ldots, k(\mathbf{z}_m, \mathbf{x}_i)]^T$.
Cost per point: $O(m^2)$.
$$\mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}[(y_i - \mu_i)^2 + \sigma_i^2]$$
Total per-iteration cost: $O(|B| \cdot m^2 + m^3)$, independent of $n$!
| Parameter | Size | Role | Optimization |
|---|---|---|---|
| $\mathbf{Z}$ | $m \times d$ | Inducing locations | Gradient descent (continuous) |
| $\mathbf{m}$ | $m \times 1$ | Variational mean | Gradient descent |
| $\mathbf{S}$ | $m \times m$ | Variational covariance | Natural gradients often better |
| Kernel params | Varies | Length scales, variance, etc. | Gradient descent |
| $\sigma^2$ | 1 | Noise variance | Gradient descent |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npimport torchimport gpytorchfrom gpytorch.models import ApproximateGPfrom gpytorch.variational import CholeskyVariationalDistributionfrom gpytorch.variational import VariationalStrategy class SVGPModel(ApproximateGP): """ Stochastic Variational Gaussian Process for scalable regression. Uses GPyTorch for efficient implementation. This model can handle millions of data points with minibatch training. """ def __init__(self, inducing_points): # Variational distribution q(u) = N(m, S) variational_distribution = CholeskyVariationalDistribution( inducing_points.size(0) ) # Variational strategy handles the inducing point framework variational_strategy = VariationalStrategy( self, inducing_points, variational_distribution, learn_inducing_locations=True # Optimize Z ) super().__init__(variational_strategy) # Standard GP components self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel( gpytorch.kernels.RBFKernel() ) def forward(self, x): mean = self.mean_module(x) covar = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean, covar) def train_svgp(train_x, train_y, n_inducing=500, n_epochs=50, batch_size=1024): """ Train an SVGP model using minibatch stochastic optimization. Parameters: ----------- train_x : tensor of shape (n, d) train_y : tensor of shape (n,) n_inducing : int n_epochs : int batch_size : int Returns: -------- model : trained SVGP model likelihood : noise model """ # Initialize inducing points from data indices = torch.randperm(train_x.size(0))[:n_inducing] inducing_points = train_x[indices].clone() # Create model and likelihood model = SVGPModel(inducing_points) likelihood = gpytorch.likelihoods.GaussianLikelihood() # Training mode model.train() likelihood.train() # Optimizer optimizer = torch.optim.Adam([ {'params': model.parameters()}, {'params': likelihood.parameters()}, ], lr=0.01) # Loss = -ELBO mll = gpytorch.mlls.VariationalELBO( likelihood, model, num_data=train_y.size(0) ) # Create dataloader for minibatching train_dataset = torch.utils.data.TensorDataset(train_x, train_y) train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=batch_size, shuffle=True ) # Training loop for epoch in range(n_epochs): epoch_loss = 0.0 for batch_x, batch_y in train_loader: optimizer.zero_grad() output = model(batch_x) loss = -mll(output, batch_y) loss.backward() optimizer.step() epoch_loss += loss.item() if (epoch + 1) % 10 == 0: avg_loss = epoch_loss / len(train_loader) print(f"Epoch {epoch+1}/{n_epochs}, Avg Loss: {avg_loss:.4f}") return model, likelihood def predict_svgp(model, likelihood, test_x): """Make predictions with uncertainty.""" model.eval() likelihood.eval() with torch.no_grad(), gpytorch.settings.fast_pred_var(): observed_pred = likelihood(model(test_x)) mean = observed_pred.mean std = observed_pred.stddev return mean, stdThe quality of sparse GP approximations depends heavily on the inducing point locations $\mathbf{Z}$. Unlike Nyström where landmarks are selected from training data, sparse GPs optimize inducing locations as continuous parameters.
Initialization strategies:
Optimization dynamics:
Inducing points are optimized jointly with other parameters via gradient descent. The gradients push inducing points toward regions where:
In practice, inducing points often migrate during training, starting from initial positions and settling into data-adaptive configurations.
Inducing point optimization can be challenging: (1) Local minima—initial placement matters, (2) Redundancy—points may collapse onto each other, (3) Escaping regions—points may leave relevant areas. Good initialization and proper regularization help. Some practitioners fix inducing locations (using k-means) and only optimize variational parameters.
Natural gradients for variational parameters:
The variational parameters $(\mathbf{m}, \mathbf{S})$ live on a curved space (Gaussian distributions form a manifold). Standard gradient descent ignores this geometry, leading to slow convergence.
Natural gradients account for the information geometry, often converging much faster:
$$\tilde{\nabla}{\mathbf{m}} \mathcal{L} = \mathbf{S}^{-1} \nabla{\mathbf{m}} \mathcal{L}$$ $$\tilde{\nabla}{\mathbf{S}} \mathcal{L} = \mathbf{S}^{-1} \nabla{\mathbf{S}} \mathcal{L} \mathbf{S}^{-1}$$
GPyTorch and other GP libraries implement efficient natural gradient updates.
Choosing the number of inducing points:
The number $m$ controls the quality-cost tradeoff:
Practical heuristics:
| Dataset Size | Suggested $m$ | Training Time | Quality |
|---|---|---|---|
| 1,000 | 50-100 | Seconds | Near-exact often achievable |
| 10,000 | 200-500 | Minutes | Excellent with good placement |
| 100,000 | 500-2,000 | Tens of minutes | Good, depends on complexity |
| 1,000,000+ | 1,000-5,000 | Hours | Practical limit for most hardware |
Let's carefully analyze the computational and memory costs of sparse GP methods.
Exact GP baseline:
VFE (full-batch):
| Operation | Complexity | Notes |
|---|---|---|
| $\mathbf{K}_{mm}$ computation | $O(m^2 d)$ | Once per evaluation |
| $\mathbf{K}_{nm}$ computation | $O(nm d)$ | Dominates for large n |
| $\mathbf{K}_{mm}$ Cholesky | $O(m^3)$ | |
| Woodbury formula | $O(nm^2)$ | For determinant and inverse |
| Trace term | $O(nm)$ | Only diagonal of $(\mathbf{K}{nn} - \mathbf{Q}{nn})$ |
| Total per iteration | $O(nm^2 + m^3)$ | |
| Memory | $O(nm)$ | Store $\mathbf{K}_{nm}$ (can batch) |
SVGP (minibatch):
| Operation | Complexity | Notes |
|---|---|---|
| $\mathbf{K}_{mm}$ Cholesky | $O(m^3)$ | Once per iteration |
| KL divergence | $O(m^3)$ | Cholesky of $\mathbf{S}$ |
| $\mathbf{K}_{Bm}$ (batch to inducing) | $O(|B|m d)$ | Per minibatch |
| Marginal $q(f_i)$ for batch | $O(|B|m^2)$ | Matrix-vector products |
| Expected log-likelihood | $O(|B|)$ | Scalar per point |
| Total per iteration | $O(|B|m^2 + m^3)$ | Independent of n! |
| Memory | $O(|B|m + m^2)$ | Truly scalable |
SVGP per-iteration cost is $O(|B|m^2 + m^3)$—completely independent of dataset size $n$! With $|B| = 1000$ and $m = 500$, each iteration takes the same time whether you have 10,000 or 10,000,000 training points. This enables GPs at unprecedented scale.
Prediction complexity:
Given a trained SVGP with variational parameters $(\mathbf{m}, \mathbf{S})$:
Mean prediction: $\mu_* = \mathbf{k}*^T \mathbf{K}{mm}^{-1} \mathbf{m}$
Variance prediction: $\sigma_^2 = k_{**} - \mathbf{k}_^T \mathbf{K}{mm}^{-1} \mathbf{k}* + \mathbf{k}*^T \mathbf{K}{mm}^{-1} \mathbf{S} \mathbf{K}{mm}^{-1} \mathbf{k}*$
Compare to exact GP: $O(n)$ for mean, $O(n^2)$ for variance. With $m << n$, predictions are dramatically faster.
Sparse GPs connect deeply with other kernel approximation methods we've studied.
Sparse GP vs. Nyström:
Nyström approximates the kernel matrix: $\tilde{\mathbf{K}} = \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}_{mn}$.
VFE uses the same low-rank structure but adds:
In fact, if we use fixed inducing points at training locations and ignore the trace term, VFE reduces to Nyström-based regression!
Sparse GP vs. RFF:
Both provide $m$-dimensional features, but with key differences:
| Aspect | Sparse GP (SVGP) | Nyström | RFF |
|---|---|---|---|
| Basis type | Inducing points (data-like) | Data points | Random Fourier features |
| Basis optimization | Yes (gradient-based) | Selection only | No (random) |
| Uncertainty | Full posterior variance | No uncertainty | No uncertainty |
| Probabilistic model | Complete Bayesian | Point estimate | Point estimate |
| Minibatch training | Yes (SVGP) | Possible but not standard | Yes (standard linear) |
| Kernel restriction | Any kernel | Any kernel | Shift-invariant only |
The unified view:
All three methods can be understood through the lens of feature-space approximation:
RFF: $\phi(\mathbf{x}) = \sqrt{\frac{2}{D}}[\cos(\boldsymbol{\omega}j^T\mathbf{x} + b_j)]{j=1}^D$ — random trigonometric features
Nyström: $\phi(\mathbf{x}) = \mathbf{K}_{mm}^{-1/2} [k(\mathbf{z}j, \mathbf{x})]{j=1}^m$ — kernel-based features centered at landmarks
Sparse GP: Same features as Nyström, but with learned $\mathbf{Z}$ and full posterior inference over weights
Sparse GPs add the Bayesian layer: instead of point-estimating the weights, they maintain distributions, enabling uncertainty quantification.
Use SVGP when: You need uncertainty estimates, Bayesian model selection is important, or you're doing Bayesian optimization/active learning. Use Nyström when: You just need predictions, computational budget is limited, or interpretability of landmarks matters. Use RFF when: The kernel is shift-invariant, simplicity is key, or you need the fastest possible training.
Successfully deploying sparse GPs requires attention to several practical details.
1. Initialization matters:
Poor initialization can lead to:
Best practices:
2. Hyperparameter optimization:
Sparse GPs still have kernel hyperparameters (length scales, variance, noise). Options:
3. Beyond regression:
SVGP extends to classification and multi-output settings:
4. Software:
Mature SVGP implementations exist in:
Modern SVGP implementations with GPU acceleration can train on millions of points in minutes to hours. They are the method of choice for scalable uncertainty-aware regression and classification, used in production systems for drug discovery, climate modeling, and autonomous vehicles.
Sparse Gaussian Processes represent the most principled approach to scaling GPs while retaining their full Bayesian character—uncertainty quantification, principled model comparison, and coherent probabilistic predictions.
What's next:
We've now covered the three pillars of scalable kernel methods: Random Fourier Features, Nyström approximation, and Sparse GPs. In the final page of this module, we synthesize these techniques into a comprehensive scalability strategies guide—when to use which method, how to combine them, and how to approach real-world large-scale kernel learning problems.
You now understand Sparse Gaussian Processes from the inducing point framework through SVGP's minibatch training. This knowledge equips you to deploy uncertainty-aware machine learning at scales that were previously impossible with full GPs.