Computational Aspects Of Kernels - Learning Module

Loading content...

0/245

Sparse Gaussian Processes

Scalable Probabilistic Learning

Gaussian Processes (GPs) are among the most powerful tools in probabilistic machine learning. They provide not just predictions, but complete predictive distributions—means and variances that quantify epistemic uncertainty. This makes GPs invaluable for applications where knowing "how sure are we?" is as important as the prediction itself: active learning, Bayesian optimization, safety-critical systems, and more.

But GPs have a problem: they don't scale. The exact GP posterior requires $O(n^3)$ computation and $O(n^2)$ memory—precisely the limitations we've been discussing throughout this module.

Sparse Gaussian Processes solve this by introducing inducing points—a small set of $m$ pseudo-observations that summarize the entire dataset. Unlike Nyström (which also uses representative points), sparse GPs work within a principled Bayesian framework, optimizing the inducing point locations and values through variational inference.

The result is a GP that scales to millions of points while retaining full uncertainty quantification.

What You Will Learn

By the end of this page, you will understand the inducing point framework and why it enables scalability, the key sparse GP approximations (FITC, VFE, SVGP), how variational inference enables minibatch training, inducing point optimization strategies, and the relationship between sparse GPs and other kernel approximations.

The Inducing Point Framework

The core idea of sparse GPs is to introduce inducing variables—function values at a set of $m$ inducing locations $\mathbf{Z} = {\mathbf{z}_1, \ldots, \mathbf{z}_m}$ (also called inducing inputs or pseudo-inputs).

The setup:

Let $\mathbf{f} = (f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n))^T$ be the function values at training inputs, and $\mathbf{u} = (f(\mathbf{z}_1), \ldots, f(\mathbf{z}_m))^T$ be the function values at inducing locations.

Under the GP prior:

$$\begin{bmatrix} \mathbf{f} \ \mathbf{u} \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} \mathbf{K}{nn} & \mathbf{K}{nm} \ \mathbf{K}{mn} & \mathbf{K}{mm} \end{bmatrix} \right)$$

where:

$\mathbf{K}_{nn}$: $n \times n$ kernel matrix among training points
$\mathbf{K}{nm} = \mathbf{K}{mn}^T$: $n \times m$ kernel matrix between training and inducing points
$\mathbf{K}_{mm}$: $m \times m$ kernel matrix among inducing points

The Key Insight

If we can express the training function values $\mathbf{f}$ in terms of the inducing values $\mathbf{u}$, then inference involves only the $m \times m$ matrix $\mathbf{K}{mm}$ instead of the $n \times n$ matrix $\mathbf{K}{nn}$. The challenge is doing this approximation in a principled way that preserves GP properties.

The conditional distribution:

From standard Gaussian conditioning, the distribution of $\mathbf{f}$ given $\mathbf{u}$ is:

$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{u}, \mathbf{K}{nn} - \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}{mn})$$

Define:

$\boldsymbol{\mu}f = \mathbf{K}{nm}\mathbf{K}_{mm}^{-1}\mathbf{u}$ — the mean (function of $\mathbf{u}$)
$\mathbf{Q}{nn} = \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}{mn}$ — the Nyström approximation to $\mathbf{K}_{nn}$
$\boldsymbol{\Sigma}f = \mathbf{K}{nn} - \mathbf{Q}_{nn}$ — the residual covariance

The graphical model:

Sparse GPs assume conditional independence given $\mathbf{u}$:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u})$$

This structure means:

All information from $\mathbf{f}$ to predictions passes through $\mathbf{u}$
Given $\mathbf{u}$, different training points become conditionally independent (or nearly so)
The inducing points act as a "bottleneck" summarizing the data

Inducing Point Notation Summary
Symbol	Size	Meaning
$\mathbf{Z}$	$m \times d$	Inducing input locations (learnable)
$\mathbf{u}$	$m \times 1$	Function values at inducing locations
$\mathbf{K}_{mm}$	$m \times m$	Kernel matrix among inducing points
$\mathbf{K}_{nm}$	$n \times m$	Kernel between training and inducing
$\mathbf{Q}_{nn}$	$n \times n$	Low-rank approximation: $\mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}_{mn}$

Sparse GP Approximations — FITC and VFE

Different sparse GP methods make different assumptions about how $\mathbf{f}$ relates to $\mathbf{u}$. The two most important are FITC (Fully Independent Training Conditional) and VFE (Variational Free Energy).

FITC (Snelson & Ghahramani, 2006):

FITC assumes the off-diagonal elements of the residual covariance $\boldsymbol{\Sigma}_f$ are zero:

$$p_{\text{FITC}}(\mathbf{f} | \mathbf{u}) = \prod_{i=1}^{n} p(f_i | \mathbf{u}) = \mathcal{N}(\boldsymbol{\mu}_f, \text{diag}(\boldsymbol{\Sigma}_f))$$

This makes training points conditionally independent given $\mathbf{u}$. The resulting approximate likelihood is:

$$p_{\text{FITC}}(\mathbf{y} | \mathbf{u}) = \mathcal{N}(\boldsymbol{\mu}_f, \text{diag}(\boldsymbol{\Sigma}_f) + \sigma^2 \mathbf{I})$$

Pros: Fast, captures heteroscedastic residual variance Cons: Approximation is heuristic, not principled; can underestimate uncertainty

VFE (Titsias, 2009):

VFE takes a variational inference approach. Instead of modifying the model, it finds the best approximation to the true posterior within a constrained family.

The variational objective (ELBO) is:

$$\mathcal{L} = \log \mathcal{N}(\mathbf{y} | \mathbf{0}, \mathbf{Q}{nn} + \sigma^2 \mathbf{I}) - \frac{1}{2\sigma^2} \text{tr}(\mathbf{K}{nn} - \mathbf{Q}_{nn})$$

The first term is the marginal likelihood under the Nyström approximation. The second term is a trace penalty that measures how much variance is lost by the approximation.

Key insight: The trace term $\text{tr}(\mathbf{K}{nn} - \mathbf{Q}{nn})$ penalizes poor inducing point placement. If the inducing points span the training data well, $\mathbf{Q}{nn} \approx \mathbf{K}{nn}$ and the penalty is small. If they miss important regions, the penalty is large.

VFE Recovery Property

VFE is a lower bound on the true marginal likelihood. As $m \to n$ with inducing points at training locations, VFE recovers the exact GP. This is a principled approximation that improves systematically as more inducing points are added.

Comparison of Sparse GP Approximations
Aspect	FITC	VFE
Derivation	Modify model (heuristic)	Variational inference (principled)
Objective	Modified marginal likelihood	Evidence Lower Bound (ELBO)
Residual variance	Diagonal (heteroscedastic)	Absorbed into trace penalty
Uncertainty	Can underestimate	Conservative (proper bound)
Recovery at m=n	Not exact	Exact GP recovered
Computational complexity	$O(nm^2)$	$O(nm^2)$

The VFE objective in detail:

Maximizing the ELBO:

$$\mathcal{L} = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{Q}{nn} + \sigma^2\mathbf{I}| - \frac{1}{2}\mathbf{y}^T(\mathbf{Q}{nn} + \sigma^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2\sigma^2}\text{tr}(\mathbf{K}{nn} - \mathbf{Q}{nn})$$

Using the Woodbury identity, we can evaluate this in $O(nm^2 + m^3)$ instead of $O(n^3)$:

$$|\mathbf{Q}{nn} + \sigma^2\mathbf{I}| = \sigma^{2n} |\mathbf{K}{mm}|^{-1} |\mathbf{K}{mm} + \sigma^{-2}\mathbf{K}{mn}\mathbf{K}_{nm}|$$

Similarly, $(\mathbf{Q}_{nn} + \sigma^2\mathbf{I})^{-1}$ can be computed efficiently using matrix inversion lemmas.

Stochastic Variational GPs (SVGP)

While VFE reduces complexity from $O(n^3)$ to $O(nm^2)$, we still need to process all $n$ data points. For truly massive datasets (millions of points), even $O(nm^2)$ is prohibitive.

Stochastic Variational Gaussian Processes (SVGP) (Hensman et al., 2013, 2015) enable minibatch training by reformulating the variational objective to decompose over data points.

The key innovation:

SVGP introduces an explicit variational distribution over the inducing variables:

$$q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S})$$

where $\mathbf{m} \in \mathbb{R}^m$ and $\mathbf{S} \in \mathbb{R}^{m \times m}$ (positive definite) are variational parameters to be optimized.

The ELBO becomes:

$$\mathcal{L} = \sum_{i=1}^{n} \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) | p(\mathbf{u}))$$

where $q(f_i)$ is the marginal of $f_i$ under $q(\mathbf{u})$.

Why This Enables Minibatches

The likelihood term is a sum over data points! We can approximate it with a minibatch:

$$\sum_{i=1}^{n} \mathbb{E}[\log p(y_i|f_i)] \approx \frac{n}{|B|} \sum_{i \in B} \mathbb{E}[\log p(y_i|f_i)]$$

where $B$ is a random minibatch. This gives an unbiased estimate of the gradient, enabling stochastic optimization.

Computing the ELBO:

KL divergence between Gaussians has a closed form:

$$\text{KL}(q(\mathbf{u}) | p(\mathbf{u})) = \frac{1}{2}\left[\text{tr}(\mathbf{K}{mm}^{-1}\mathbf{S}) + \mathbf{m}^T\mathbf{K}{mm}^{-1}\mathbf{m} - m + \log\frac{|\mathbf{K}_{mm}|}{|\mathbf{S}|}\right]$$

Cost: $O(m^3)$ for Cholesky of $\mathbf{K}_{mm}$ and $\mathbf{S}$.

Marginal distribution $q(f_i)$ for a training point:

$$q(f_i) = \mathcal{N}(\mu_i, \sigma_i^2)$$

where:

$\mu_i = \mathbf{k}i^T \mathbf{K}{mm}^{-1} \mathbf{m}$
$\sigma_i^2 = k_{ii} - \mathbf{k}i^T \mathbf{K}{mm}^{-1} \mathbf{k}i + \mathbf{k}i^T \mathbf{K}{mm}^{-1} \mathbf{S} \mathbf{K}{mm}^{-1} \mathbf{k}_i$

with $\mathbf{k}_i = [k(\mathbf{z}_1, \mathbf{x}_i), \ldots, k(\mathbf{z}_m, \mathbf{x}_i)]^T$.

Cost per point: $O(m^2)$.

Expected log-likelihood for Gaussian observation model:

$$\mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}[(y_i - \mu_i)^2 + \sigma_i^2]$$

Total per-iteration cost: $O(|B| \cdot m^2 + m^3)$, independent of $n$!

SVGP Optimization Parameters
Parameter	Size	Role	Optimization
$\mathbf{Z}$	$m \times d$	Inducing locations	Gradient descent (continuous)
$\mathbf{m}$	$m \times 1$	Variational mean	Gradient descent
$\mathbf{S}$	$m \times m$	Variational covariance	Natural gradients often better
Kernel params	Varies	Length scales, variance, etc.	Gradient descent
$\sigma^2$	1	Noise variance	Gradient descent

svgp_overview.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import torch
import gpytorch
from gpytorch.models import ApproximateGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy
 
class SVGPModel(ApproximateGP):
    """
    Stochastic Variational Gaussian Process for scalable regression.
    
    Uses GPyTorch for efficient implementation. This model can handle
    millions of data points with minibatch training.
    """
    
    def __init__(self, inducing_points):
        # Variational distribution q(u) = N(m, S)
        variational_distribution = CholeskyVariationalDistribution(
            inducing_points.size(0)
        )
        
        # Variational strategy handles the inducing point framework
        variational_strategy = VariationalStrategy(
            self, 
            inducing_points,
            variational_distribution,
            learn_inducing_locations=True  # Optimize Z
        )
        
        super().__init__(variational_strategy)
        
        # Standard GP components
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )
    
    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)
 
 
def train_svgp(train_x, train_y, n_inducing=500, n_epochs=50, batch_size=1024):
    """
    Train an SVGP model using minibatch stochastic optimization.
    
    Parameters:
    -----------
    train_x : tensor of shape (n, d)
    train_y : tensor of shape (n,)
    n_inducing : int
    n_epochs : int
    batch_size : int
    
    Returns:
    --------
    model : trained SVGP model
    likelihood : noise model
    """
    # Initialize inducing points from data
    indices = torch.randperm(train_x.size(0))[:n_inducing]
    inducing_points = train_x[indices].clone()
    
    # Create model and likelihood
    model = SVGPModel(inducing_points)
    likelihood = gpytorch.likelihoods.GaussianLikelihood()
    
    # Training mode
    model.train()
    likelihood.train()
    
    # Optimizer
    optimizer = torch.optim.Adam([
        {'params': model.parameters()},
        {'params': likelihood.parameters()},
    ], lr=0.01)
    
    # Loss = -ELBO
    mll = gpytorch.mlls.VariationalELBO(
        likelihood, model, num_data=train_y.size(0)
    )
    
    # Create dataloader for minibatching
    train_dataset = torch.utils.data.TensorDataset(train_x, train_y)
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True
    )
    
    # Training loop
    for epoch in range(n_epochs):
        epoch_loss = 0.0
        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = -mll(output, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        
        if (epoch + 1) % 10 == 0:
            avg_loss = epoch_loss / len(train_loader)
            print(f"Epoch {epoch+1}/{n_epochs}, Avg Loss: {avg_loss:.4f}")
    
    return model, likelihood
 
 
def predict_svgp(model, likelihood, test_x):
    """Make predictions with uncertainty."""
    model.eval()
    likelihood.eval()
    
    with torch.no_grad(), gpytorch.settings.fast_pred_var():
        observed_pred = likelihood(model(test_x))
        mean = observed_pred.mean
        std = observed_pred.stddev
    
    return mean, std

Inducing Point Optimization

The quality of sparse GP approximations depends heavily on the inducing point locations $\mathbf{Z}$. Unlike Nyström where landmarks are selected from training data, sparse GPs optimize inducing locations as continuous parameters.

Initialization strategies:

Random subset: Sample $m$ training points as initial locations
K-means centers: Use cluster centroids to cover the data space
Greedy maximization: Select points maximizing information gain
Latin hypercube: Space-filling design for low dimensions

Optimization dynamics:

Inducing points are optimized jointly with other parameters via gradient descent. The gradients push inducing points toward regions where:

Data density is high (more information to capture)
Residual uncertainty is large (poorly explained regions)
Function varies rapidly (need more resolution)

In practice, inducing points often migrate during training, starting from initial positions and settling into data-adaptive configurations.

Optimization Challenges

Inducing point optimization can be challenging: (1) Local minima—initial placement matters, (2) Redundancy—points may collapse onto each other, (3) Escaping regions—points may leave relevant areas. Good initialization and proper regularization help. Some practitioners fix inducing locations (using k-means) and only optimize variational parameters.

Natural gradients for variational parameters:

The variational parameters $(\mathbf{m}, \mathbf{S})$ live on a curved space (Gaussian distributions form a manifold). Standard gradient descent ignores this geometry, leading to slow convergence.

Natural gradients account for the information geometry, often converging much faster:

$$\tilde{\nabla}{\mathbf{m}} \mathcal{L} = \mathbf{S}^{-1} \nabla{\mathbf{m}} \mathcal{L}$$ $$\tilde{\nabla}{\mathbf{S}} \mathcal{L} = \mathbf{S}^{-1} \nabla{\mathbf{S}} \mathcal{L} \mathbf{S}^{-1}$$

GPyTorch and other GP libraries implement efficient natural gradient updates.

Choosing the number of inducing points:

The number $m$ controls the quality-cost tradeoff:

Too few: Poor approximation, underfit, high uncertainty
Too many: Slow training, diminishing returns

Practical heuristics:

Start with $m \approx \sqrt{n}$ or $m \approx 100-1000$
Increase until ELBO plateaus or cross-validation stops improving
For high-frequency functions or scattered data, more inducing points are needed

Inducing Point Count Guidelines
Dataset Size	Suggested $m$	Training Time	Quality
1,000	50-100	Seconds	Near-exact often achievable
10,000	200-500	Minutes	Excellent with good placement
100,000	500-2,000	Tens of minutes	Good, depends on complexity
1,000,000+	1,000-5,000	Hours	Practical limit for most hardware

Computational Complexity Deep Dive

Let's carefully analyze the computational and memory costs of sparse GP methods.

Exact GP baseline:

Training: $O(n^3)$ for Cholesky decomposition
Memory: $O(n^2)$ for kernel matrix
Prediction: $O(n)$ per test point for mean, $O(n^2)$ for variance

VFE (full-batch):

VFE Computational Costs
Operation	Complexity	Notes
$\mathbf{K}_{mm}$ computation	$O(m^2 d)$	Once per evaluation
$\mathbf{K}_{nm}$ computation	$O(nm d)$	Dominates for large n
$\mathbf{K}_{mm}$ Cholesky	$O(m^3)$
Woodbury formula	$O(nm^2)$	For determinant and inverse
Trace term	$O(nm)$	Only diagonal of $(\mathbf{K}{nn} - \mathbf{Q}{nn})$
Total per iteration	$O(nm^2 + m^3)$
Memory	$O(nm)$	Store $\mathbf{K}_{nm}$ (can batch)

SVGP (minibatch):

SVGP Minibatch Costs
Operation	Complexity	Notes
$\mathbf{K}_{mm}$ Cholesky	$O(m^3)$	Once per iteration
KL divergence	$O(m^3)$	Cholesky of $\mathbf{S}$
$\mathbf{K}_{Bm}$ (batch to inducing)	$O(\|B\|m d)$	Per minibatch
Marginal $q(f_i)$ for batch	$O(\|B\|m^2)$	Matrix-vector products
Expected log-likelihood	$O(\|B\|)$	Scalar per point
Total per iteration	$O(\|B\|m^2 + m^3)$	Independent of n!
Memory	$O(\|B\|m + m^2)$	Truly scalable

The Scalability Breakthrough

SVGP per-iteration cost is $O(|B|m^2 + m^3)$—completely independent of dataset size $n$! With $|B| = 1000$ and $m = 500$, each iteration takes the same time whether you have 10,000 or 10,000,000 training points. This enables GPs at unprecedented scale.

Prediction complexity:

Given a trained SVGP with variational parameters $(\mathbf{m}, \mathbf{S})$:

Mean prediction: $\mu_* = \mathbf{k}*^T \mathbf{K}{mm}^{-1} \mathbf{m}$
- Cost: $O(m^2)$ per test point (can precompute $\mathbf{K}_{mm}^{-1}\mathbf{m}$, then $O(m)$)
Variance prediction: $\sigma_^2 = k_{**} - \mathbf{k}_^T \mathbf{K}{mm}^{-1} \mathbf{k}* + \mathbf{k}*^T \mathbf{K}{mm}^{-1} \mathbf{S} \mathbf{K}{mm}^{-1} \mathbf{k}*$
- Cost: $O(m^2)$ per test point

Compare to exact GP: $O(n)$ for mean, $O(n^2)$ for variance. With $m << n$, predictions are dramatically faster.

Relationship to Other Approximations

Sparse GPs connect deeply with other kernel approximation methods we've studied.

Sparse GP vs. Nyström:

Nyström approximates the kernel matrix: $\tilde{\mathbf{K}} = \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}_{mn}$.

VFE uses the same low-rank structure but adds:

A trace penalty for the approximation error
Principled uncertainty quantification
Optimization of inducing locations

In fact, if we use fixed inducing points at training locations and ignore the trace term, VFE reduces to Nyström-based regression!

Sparse GP vs. RFF:

Both provide $m$-dimensional features, but with key differences:

Sparse GP vs. Other Approximations
Aspect	Sparse GP (SVGP)	Nyström	RFF
Basis type	Inducing points (data-like)	Data points	Random Fourier features
Basis optimization	Yes (gradient-based)	Selection only	No (random)
Uncertainty	Full posterior variance	No uncertainty	No uncertainty
Probabilistic model	Complete Bayesian	Point estimate	Point estimate
Minibatch training	Yes (SVGP)	Possible but not standard	Yes (standard linear)
Kernel restriction	Any kernel	Any kernel	Shift-invariant only

The unified view:

All three methods can be understood through the lens of feature-space approximation:

RFF: $\phi(\mathbf{x}) = \sqrt{\frac{2}{D}}[\cos(\boldsymbol{\omega}j^T\mathbf{x} + b_j)]{j=1}^D$ — random trigonometric features
Nyström: $\phi(\mathbf{x}) = \mathbf{K}_{mm}^{-1/2} [k(\mathbf{z}j, \mathbf{x})]{j=1}^m$ — kernel-based features centered at landmarks
Sparse GP: Same features as Nyström, but with learned $\mathbf{Z}$ and full posterior inference over weights

Sparse GPs add the Bayesian layer: instead of point-estimating the weights, they maintain distributions, enabling uncertainty quantification.

When to Use What

Use SVGP when: You need uncertainty estimates, Bayesian model selection is important, or you're doing Bayesian optimization/active learning. Use Nyström when: You just need predictions, computational budget is limited, or interpretability of landmarks matters. Use RFF when: The kernel is shift-invariant, simplicity is key, or you need the fastest possible training.

Practical Considerations

Successfully deploying sparse GPs requires attention to several practical details.

1. Initialization matters:

Poor initialization can lead to:

Local minima (inducing points stuck in suboptimal configurations)
Collapsed inducing points (multiple points at the same location)
Slow convergence

Best practices:

Initialize $\mathbf{Z}$ with k-means centers or random data subset
Initialize $\mathbf{m}$ to zeros or prior mean
Initialize $\mathbf{S}$ to prior covariance $\mathbf{K}_{mm}$
Warm-start by freezing $\mathbf{Z}$ initially

SVGP Best Practices

•Start simple: Begin with fewer inducing points, increase if needed. 500-1000 is often sufficient.
•Use k-means initialization: More robust than random sampling for inducing locations.
•Natural gradients for variational parameters: Dramatically improves convergence for $(\mathbf{m}, \mathbf{S})$.
•Whiten the parameterization: Many libraries use $\mathbf{u} = \mathbf{L}_{mm}\mathbf{v}$ where $\mathbf{v} \sim \mathcal{N}(0,I)$ for better conditioning.
•Monitor ELBO during training: Should increase monotonically (on average). Decreasing ELBO indicates problems.
•Validate uncertainty calibration: Check that predicted credible intervals cover the correct proportion of test points.

2. Hyperparameter optimization:

Sparse GPs still have kernel hyperparameters (length scales, variance, noise). Options:

Joint optimization: Optimize with ELBO alongside variational parameters
Two-stage: First fit hyperparameters with subset, then full SVGP
Marginal likelihood: For small datasets, use exact GP for hyperparameters, then transfer to SVGP

3. Beyond regression:

SVGP extends to classification and multi-output settings:

Classification: Use Bernoulli likelihood; expected log-likelihood no longer closed-form, but tractable via Gauss-Hermite quadrature or Monte Carlo
Multi-output GP: Share inducing points across outputs with correlation modeling
Deep GPs: Stack multiple GP layers with inducing points at each level

4. Software:

Mature SVGP implementations exist in:

GPyTorch (PyTorch, highly optimized)
GPflow (TensorFlow)
GPy (NumPy/SciPy, slower but pure Python)
GPML (MATLAB)

The State of the Art

Modern SVGP implementations with GPU acceleration can train on millions of points in minutes to hours. They are the method of choice for scalable uncertainty-aware regression and classification, used in production systems for drug discovery, climate modeling, and autonomous vehicles.

Summary: Scalable Probabilistic Inference

Sparse Gaussian Processes represent the most principled approach to scaling GPs while retaining their full Bayesian character—uncertainty quantification, principled model comparison, and coherent probabilistic predictions.

Key Takeaways

•Inducing points summarize the dataset through a small set of pseudo-observations, reducing complexity from $O(n^3)$ to $O(nm^2)$.
•VFE casts sparse GP as variational inference, providing a principled ELBO objective that recovers the exact GP as $m \to n$.
•SVGP enables minibatch training with per-iteration cost $O(|B|m^2 + m^3)$—independent of dataset size, enabling training on millions of points.
•Inducing point optimization pushes representatives toward high-information regions; initialization and natural gradients are important for success.
•Full uncertainty quantification is preserved—sparse GPs provide posterior means and variances, essential for Bayesian decision-making.
•Deep connections exist between sparse GPs, Nyström, and RFF—all are feature-based approximations, but sparse GPs add the Bayesian layer.

What's next:

We've now covered the three pillars of scalable kernel methods: Random Fourier Features, Nyström approximation, and Sparse GPs. In the final page of this module, we synthesize these techniques into a comprehensive scalability strategies guide—when to use which method, how to combine them, and how to approach real-world large-scale kernel learning problems.

Page Complete

You now understand Sparse Gaussian Processes from the inducing point framework through SVGP's minibatch training. This knowledge equips you to deploy uncertainty-aware machine learning at scales that were previously impossible with full GPs.

Sparse Gaussian Processes

Scalable Probabilistic Learning

But GPs have a problem: they don't scale. The exact GP posterior requires $O(n^3)$ computation and $O(n^2)$ memory—precisely the limitations we've been discussing throughout this module.

The result is a GP that scales to millions of points while retaining full uncertainty quantification.

What You Will Learn

The Inducing Point Framework

The setup:

Under the GP prior:

where:

$\mathbf{K}_{nn}$: $n \times n$ kernel matrix among training points
$\mathbf{K}{nm} = \mathbf{K}{mn}^T$: $n \times m$ kernel matrix between training and inducing points
$\mathbf{K}_{mm}$: $m \times m$ kernel matrix among inducing points

The Key Insight

The conditional distribution:

From standard Gaussian conditioning, the distribution of $\mathbf{f}$ given $\mathbf{u}$ is:

$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{u}, \mathbf{K}{nn} - \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}{mn})$$

Define:

$\boldsymbol{\mu}f = \mathbf{K}{nm}\mathbf{K}_{mm}^{-1}\mathbf{u}$ — the mean (function of $\mathbf{u}$)
$\mathbf{Q}{nn} = \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}{mn}$ — the Nyström approximation to $\mathbf{K}_{nn}$
$\boldsymbol{\Sigma}f = \mathbf{K}{nn} - \mathbf{Q}_{nn}$ — the residual covariance

The graphical model:

Sparse GPs assume conditional independence given $\mathbf{u}$:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u})$$

This structure means:

All information from $\mathbf{f}$ to predictions passes through $\mathbf{u}$
Given $\mathbf{u}$, different training points become conditionally independent (or nearly so)
The inducing points act as a "bottleneck" summarizing the data

Inducing Point Notation Summary
Symbol	Size	Meaning
$\mathbf{Z}$	$m \times d$	Inducing input locations (learnable)
$\mathbf{u}$	$m \times 1$	Function values at inducing locations
$\mathbf{K}_{mm}$	$m \times m$	Kernel matrix among inducing points
$\mathbf{K}_{nm}$	$n \times m$	Kernel between training and inducing
$\mathbf{Q}_{nn}$	$n \times n$	Low-rank approximation: $\mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}_{mn}$

Sparse GP Approximations — FITC and VFE

FITC (Snelson & Ghahramani, 2006):

FITC assumes the off-diagonal elements of the residual covariance $\boldsymbol{\Sigma}_f$ are zero:

$$p_{\text{FITC}}(\mathbf{f} | \mathbf{u}) = \prod_{i=1}^{n} p(f_i | \mathbf{u}) = \mathcal{N}(\boldsymbol{\mu}_f, \text{diag}(\boldsymbol{\Sigma}_f))$$

This makes training points conditionally independent given $\mathbf{u}$. The resulting approximate likelihood is:

$$p_{\text{FITC}}(\mathbf{y} | \mathbf{u}) = \mathcal{N}(\boldsymbol{\mu}_f, \text{diag}(\boldsymbol{\Sigma}_f) + \sigma^2 \mathbf{I})$$

Pros: Fast, captures heteroscedastic residual variance Cons: Approximation is heuristic, not principled; can underestimate uncertainty

VFE (Titsias, 2009):

VFE takes a variational inference approach. Instead of modifying the model, it finds the best approximation to the true posterior within a constrained family.

The variational objective (ELBO) is:

$$\mathcal{L} = \log \mathcal{N}(\mathbf{y} | \mathbf{0}, \mathbf{Q}{nn} + \sigma^2 \mathbf{I}) - \frac{1}{2\sigma^2} \text{tr}(\mathbf{K}{nn} - \mathbf{Q}_{nn})$$

The first term is the marginal likelihood under the Nyström approximation. The second term is a trace penalty that measures how much variance is lost by the approximation.

VFE Recovery Property

Comparison of Sparse GP Approximations
Aspect	FITC	VFE
Derivation	Modify model (heuristic)	Variational inference (principled)
Objective	Modified marginal likelihood	Evidence Lower Bound (ELBO)
Residual variance	Diagonal (heteroscedastic)	Absorbed into trace penalty
Uncertainty	Can underestimate	Conservative (proper bound)
Recovery at m=n	Not exact	Exact GP recovered
Computational complexity	$O(nm^2)$	$O(nm^2)$

The VFE objective in detail:

Maximizing the ELBO:

Using the Woodbury identity, we can evaluate this in $O(nm^2 + m^3)$ instead of $O(n^3)$:

$$|\mathbf{Q}{nn} + \sigma^2\mathbf{I}| = \sigma^{2n} |\mathbf{K}{mm}|^{-1} |\mathbf{K}{mm} + \sigma^{-2}\mathbf{K}{mn}\mathbf{K}_{nm}|$$

Similarly, $(\mathbf{Q}_{nn} + \sigma^2\mathbf{I})^{-1}$ can be computed efficiently using matrix inversion lemmas.

Stochastic Variational GPs (SVGP)

While VFE reduces complexity from $O(n^3)$ to $O(nm^2)$, we still need to process all $n$ data points. For truly massive datasets (millions of points), even $O(nm^2)$ is prohibitive.

Stochastic Variational Gaussian Processes (SVGP) (Hensman et al., 2013, 2015) enable minibatch training by reformulating the variational objective to decompose over data points.

The key innovation:

SVGP introduces an explicit variational distribution over the inducing variables:

$$q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S})$$

where $\mathbf{m} \in \mathbb{R}^m$ and $\mathbf{S} \in \mathbb{R}^{m \times m}$ (positive definite) are variational parameters to be optimized.

The ELBO becomes:

$$\mathcal{L} = \sum_{i=1}^{n} \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) | p(\mathbf{u}))$$

where $q(f_i)$ is the marginal of $f_i$ under $q(\mathbf{u})$.

Why This Enables Minibatches

The likelihood term is a sum over data points! We can approximate it with a minibatch:

$$\sum_{i=1}^{n} \mathbb{E}[\log p(y_i|f_i)] \approx \frac{n}{|B|} \sum_{i \in B} \mathbb{E}[\log p(y_i|f_i)]$$

where $B$ is a random minibatch. This gives an unbiased estimate of the gradient, enabling stochastic optimization.

Computing the ELBO:

KL divergence between Gaussians has a closed form:

Cost: $O(m^3)$ for Cholesky of $\mathbf{K}_{mm}$ and $\mathbf{S}$.

Marginal distribution $q(f_i)$ for a training point:

$$q(f_i) = \mathcal{N}(\mu_i, \sigma_i^2)$$

where:

$\mu_i = \mathbf{k}i^T \mathbf{K}{mm}^{-1} \mathbf{m}$
$\sigma_i^2 = k_{ii} - \mathbf{k}i^T \mathbf{K}{mm}^{-1} \mathbf{k}i + \mathbf{k}i^T \mathbf{K}{mm}^{-1} \mathbf{S} \mathbf{K}{mm}^{-1} \mathbf{k}_i$

with $\mathbf{k}_i = [k(\mathbf{z}_1, \mathbf{x}_i), \ldots, k(\mathbf{z}_m, \mathbf{x}_i)]^T$.

Cost per point: $O(m^2)$.

Expected log-likelihood for Gaussian observation model:

$$\mathbb{E}_{q(f_i)}[\log p(y_i|f_i)] = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}[(y_i - \mu_i)^2 + \sigma_i^2]$$

Total per-iteration cost: $O(|B| \cdot m^2 + m^3)$, independent of $n$!

SVGP Optimization Parameters
Parameter	Size	Role	Optimization
$\mathbf{Z}$	$m \times d$	Inducing locations	Gradient descent (continuous)
$\mathbf{m}$	$m \times 1$	Variational mean	Gradient descent
$\mathbf{S}$	$m \times m$	Variational covariance	Natural gradients often better
Kernel params	Varies	Length scales, variance, etc.	Gradient descent
$\sigma^2$	1	Noise variance	Gradient descent

svgp_overview.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import torch
import gpytorch
from gpytorch.models import ApproximateGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy
 
class SVGPModel(ApproximateGP):
    """
    Stochastic Variational Gaussian Process for scalable regression.
    
    Uses GPyTorch for efficient implementation. This model can handle
    millions of data points with minibatch training.
    """
    
    def __init__(self, inducing_points):
        # Variational distribution q(u) = N(m, S)
        variational_distribution = CholeskyVariationalDistribution(
            inducing_points.size(0)
        )
        
        # Variational strategy handles the inducing point framework
        variational_strategy = VariationalStrategy(
            self, 
            inducing_points,
            variational_distribution,
            learn_inducing_locations=True  # Optimize Z
        )
        
        super().__init__(variational_strategy)
        
        # Standard GP components
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )
    
    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)
 
 
def train_svgp(train_x, train_y, n_inducing=500, n_epochs=50, batch_size=1024):
    """
    Train an SVGP model using minibatch stochastic optimization.
    
    Parameters:
    -----------
    train_x : tensor of shape (n, d)
    train_y : tensor of shape (n,)
    n_inducing : int
    n_epochs : int
    batch_size : int
    
    Returns:
    --------
    model : trained SVGP model
    likelihood : noise model
    """
    # Initialize inducing points from data
    indices = torch.randperm(train_x.size(0))[:n_inducing]
    inducing_points = train_x[indices].clone()
    
    # Create model and likelihood
    model = SVGPModel(inducing_points)
    likelihood = gpytorch.likelihoods.GaussianLikelihood()
    
    # Training mode
    model.train()
    likelihood.train()
    
    # Optimizer
    optimizer = torch.optim.Adam([
        {'params': model.parameters()},
        {'params': likelihood.parameters()},
    ], lr=0.01)
    
    # Loss = -ELBO
    mll = gpytorch.mlls.VariationalELBO(
        likelihood, model, num_data=train_y.size(0)
    )
    
    # Create dataloader for minibatching
    train_dataset = torch.utils.data.TensorDataset(train_x, train_y)
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True
    )
    
    # Training loop
    for epoch in range(n_epochs):
        epoch_loss = 0.0
        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = -mll(output, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        
        if (epoch + 1) % 10 == 0:
            avg_loss = epoch_loss / len(train_loader)
            print(f"Epoch {epoch+1}/{n_epochs}, Avg Loss: {avg_loss:.4f}")
    
    return model, likelihood
 
 
def predict_svgp(model, likelihood, test_x):
    """Make predictions with uncertainty."""
    model.eval()
    likelihood.eval()
    
    with torch.no_grad(), gpytorch.settings.fast_pred_var():
        observed_pred = likelihood(model(test_x))
        mean = observed_pred.mean
        std = observed_pred.stddev
    
    return mean, std

Inducing Point Optimization

Initialization strategies:

Random subset: Sample $m$ training points as initial locations
K-means centers: Use cluster centroids to cover the data space
Greedy maximization: Select points maximizing information gain
Latin hypercube: Space-filling design for low dimensions

Optimization dynamics:

Inducing points are optimized jointly with other parameters via gradient descent. The gradients push inducing points toward regions where:

Data density is high (more information to capture)
Residual uncertainty is large (poorly explained regions)
Function varies rapidly (need more resolution)

In practice, inducing points often migrate during training, starting from initial positions and settling into data-adaptive configurations.

Optimization Challenges

Natural gradients for variational parameters:

The variational parameters $(\mathbf{m}, \mathbf{S})$ live on a curved space (Gaussian distributions form a manifold). Standard gradient descent ignores this geometry, leading to slow convergence.

Natural gradients account for the information geometry, often converging much faster:

GPyTorch and other GP libraries implement efficient natural gradient updates.

Choosing the number of inducing points:

The number $m$ controls the quality-cost tradeoff:

Too few: Poor approximation, underfit, high uncertainty
Too many: Slow training, diminishing returns

Practical heuristics:

Start with $m \approx \sqrt{n}$ or $m \approx 100-1000$
Increase until ELBO plateaus or cross-validation stops improving
For high-frequency functions or scattered data, more inducing points are needed

Inducing Point Count Guidelines
Dataset Size	Suggested $m$	Training Time	Quality
1,000	50-100	Seconds	Near-exact often achievable
10,000	200-500	Minutes	Excellent with good placement
100,000	500-2,000	Tens of minutes	Good, depends on complexity
1,000,000+	1,000-5,000	Hours	Practical limit for most hardware

Computational Complexity Deep Dive

Let's carefully analyze the computational and memory costs of sparse GP methods.

Exact GP baseline:

Training: $O(n^3)$ for Cholesky decomposition
Memory: $O(n^2)$ for kernel matrix
Prediction: $O(n)$ per test point for mean, $O(n^2)$ for variance

VFE (full-batch):

VFE Computational Costs
Operation	Complexity	Notes
$\mathbf{K}_{mm}$ computation	$O(m^2 d)$	Once per evaluation
$\mathbf{K}_{nm}$ computation	$O(nm d)$	Dominates for large n
$\mathbf{K}_{mm}$ Cholesky	$O(m^3)$
Woodbury formula	$O(nm^2)$	For determinant and inverse
Trace term	$O(nm)$	Only diagonal of $(\mathbf{K}{nn} - \mathbf{Q}{nn})$
Total per iteration	$O(nm^2 + m^3)$
Memory	$O(nm)$	Store $\mathbf{K}_{nm}$ (can batch)

SVGP (minibatch):

SVGP Minibatch Costs
Operation	Complexity	Notes
$\mathbf{K}_{mm}$ Cholesky	$O(m^3)$	Once per iteration
KL divergence	$O(m^3)$	Cholesky of $\mathbf{S}$
$\mathbf{K}_{Bm}$ (batch to inducing)	$O(\|B\|m d)$	Per minibatch
Marginal $q(f_i)$ for batch	$O(\|B\|m^2)$	Matrix-vector products
Expected log-likelihood	$O(\|B\|)$	Scalar per point
Total per iteration	$O(\|B\|m^2 + m^3)$	Independent of n!
Memory	$O(\|B\|m + m^2)$	Truly scalable

The Scalability Breakthrough

Prediction complexity:

Given a trained SVGP with variational parameters $(\mathbf{m}, \mathbf{S})$:

Mean prediction: $\mu_* = \mathbf{k}*^T \mathbf{K}{mm}^{-1} \mathbf{m}$
- Cost: $O(m^2)$ per test point (can precompute $\mathbf{K}_{mm}^{-1}\mathbf{m}$, then $O(m)$)
Variance prediction: $\sigma_^2 = k_{**} - \mathbf{k}_^T \mathbf{K}{mm}^{-1} \mathbf{k}* + \mathbf{k}*^T \mathbf{K}{mm}^{-1} \mathbf{S} \mathbf{K}{mm}^{-1} \mathbf{k}*$
- Cost: $O(m^2)$ per test point

Compare to exact GP: $O(n)$ for mean, $O(n^2)$ for variance. With $m << n$, predictions are dramatically faster.

Relationship to Other Approximations

Sparse GPs connect deeply with other kernel approximation methods we've studied.

Sparse GP vs. Nyström:

Nyström approximates the kernel matrix: $\tilde{\mathbf{K}} = \mathbf{K}{nm}\mathbf{K}{mm}^{-1}\mathbf{K}_{mn}$.

VFE uses the same low-rank structure but adds:

A trace penalty for the approximation error
Principled uncertainty quantification
Optimization of inducing locations

In fact, if we use fixed inducing points at training locations and ignore the trace term, VFE reduces to Nyström-based regression!

Sparse GP vs. RFF:

Both provide $m$-dimensional features, but with key differences:

Sparse GP vs. Other Approximations
Aspect	Sparse GP (SVGP)	Nyström	RFF
Basis type	Inducing points (data-like)	Data points	Random Fourier features
Basis optimization	Yes (gradient-based)	Selection only	No (random)
Uncertainty	Full posterior variance	No uncertainty	No uncertainty
Probabilistic model	Complete Bayesian	Point estimate	Point estimate
Minibatch training	Yes (SVGP)	Possible but not standard	Yes (standard linear)
Kernel restriction	Any kernel	Any kernel	Shift-invariant only

The unified view:

All three methods can be understood through the lens of feature-space approximation:

RFF: $\phi(\mathbf{x}) = \sqrt{\frac{2}{D}}[\cos(\boldsymbol{\omega}j^T\mathbf{x} + b_j)]{j=1}^D$ — random trigonometric features
Nyström: $\phi(\mathbf{x}) = \mathbf{K}_{mm}^{-1/2} [k(\mathbf{z}j, \mathbf{x})]{j=1}^m$ — kernel-based features centered at landmarks
Sparse GP: Same features as Nyström, but with learned $\mathbf{Z}$ and full posterior inference over weights

Sparse GPs add the Bayesian layer: instead of point-estimating the weights, they maintain distributions, enabling uncertainty quantification.

When to Use What

Practical Considerations

Successfully deploying sparse GPs requires attention to several practical details.

1. Initialization matters:

Poor initialization can lead to:

Local minima (inducing points stuck in suboptimal configurations)
Collapsed inducing points (multiple points at the same location)
Slow convergence

Best practices:

Initialize $\mathbf{Z}$ with k-means centers or random data subset
Initialize $\mathbf{m}$ to zeros or prior mean
Initialize $\mathbf{S}$ to prior covariance $\mathbf{K}_{mm}$
Warm-start by freezing $\mathbf{Z}$ initially

SVGP Best Practices

•Start simple: Begin with fewer inducing points, increase if needed. 500-1000 is often sufficient.
•Use k-means initialization: More robust than random sampling for inducing locations.
•Natural gradients for variational parameters: Dramatically improves convergence for $(\mathbf{m}, \mathbf{S})$.
•Whiten the parameterization: Many libraries use $\mathbf{u} = \mathbf{L}_{mm}\mathbf{v}$ where $\mathbf{v} \sim \mathcal{N}(0,I)$ for better conditioning.
•Monitor ELBO during training: Should increase monotonically (on average). Decreasing ELBO indicates problems.
•Validate uncertainty calibration: Check that predicted credible intervals cover the correct proportion of test points.

2. Hyperparameter optimization:

Sparse GPs still have kernel hyperparameters (length scales, variance, noise). Options:

Joint optimization: Optimize with ELBO alongside variational parameters
Two-stage: First fit hyperparameters with subset, then full SVGP
Marginal likelihood: For small datasets, use exact GP for hyperparameters, then transfer to SVGP

3. Beyond regression:

SVGP extends to classification and multi-output settings:

Classification: Use Bernoulli likelihood; expected log-likelihood no longer closed-form, but tractable via Gauss-Hermite quadrature or Monte Carlo
Multi-output GP: Share inducing points across outputs with correlation modeling
Deep GPs: Stack multiple GP layers with inducing points at each level

4. Software:

Mature SVGP implementations exist in:

GPyTorch (PyTorch, highly optimized)
GPflow (TensorFlow)
GPy (NumPy/SciPy, slower but pure Python)
GPML (MATLAB)

The State of the Art

Summary: Scalable Probabilistic Inference

Key Takeaways

•Inducing points summarize the dataset through a small set of pseudo-observations, reducing complexity from $O(n^3)$ to $O(nm^2)$.
•VFE casts sparse GP as variational inference, providing a principled ELBO objective that recovers the exact GP as $m \to n$.
•SVGP enables minibatch training with per-iteration cost $O(|B|m^2 + m^3)$—independent of dataset size, enabling training on millions of points.
•Inducing point optimization pushes representatives toward high-information regions; initialization and natural gradients are important for success.
•Full uncertainty quantification is preserved—sparse GPs provide posterior means and variances, essential for Bayesian decision-making.
•Deep connections exist between sparse GPs, Nyström, and RFF—all are feature-based approximations, but sparse GPs add the Bayesian layer.

What's next:

Page Complete