Scalable Gps - Learning Module

Loading content...

0/245

Variational Sparse GPs

From Heuristic to Principled: The Variational Revolution

The sparse GP methods we have explored—DTC, FITC, and their variants—represent pragmatic approximations that reduce computational complexity. But they carry an uncomfortable property: they lack principled justification as inference procedures. We introduced approximations that seemed reasonable, verified they were computationally tractable, and hoped the resulting predictions would be sensible.

Variational Sparse Gaussian Processes transform this landscape. By recasting sparse GP inference within the framework of variational inference, we obtain:

A principled optimization objective (the ELBO) that bounds the true log marginal likelihood
Theoretical guarantees about approximation quality
A unified framework that encompasses and extends classical sparse methods
The ability to perform stochastic optimization with minibatches, enabling massive scale
Proper uncertainty quantification that doesn't systematically under- or overestimate

This variational perspective—initiated by Titsias (2009) and refined in subsequent work—represents the modern paradigm for scalable Gaussian Processes. Understanding this framework is essential for any serious practitioner.

This page develops the variational sparse GP framework from first principles, derives the key ELBO objective, and demonstrates how this enables scalable inference on datasets previously considered impossible for GP methods.

What You Will Learn

By the end of this page, you will understand how variational inference applies to sparse GPs, derive the ELBO objective step by step, see how DTC emerges as an optimal variational distribution, appreciate the crucial trace term correction, and implement stochastic variational inference for GPs that scales to millions of data points.

Variational Inference for GPs: The Setup

Recall the general variational inference framework. Given data $\mathbf{y}$ and latent variables $\mathbf{z}$, we want the posterior $p(\mathbf{z}|\mathbf{y})$. When this is intractable, we approximate it with a tractable distribution $q(\mathbf{z})$ by minimizing the KL divergence:

$$\text{KL}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{y})) = \int q(\mathbf{z}) \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{y})} d\mathbf{z}$$

This is equivalent to maximizing the Evidence Lower Bound (ELBO):

$$\mathcal{L}(q) = \log p(\mathbf{y}) - \text{KL}(q(\mathbf{z}) \| p(\mathbf{z}|\mathbf{y})) = \mathbb{E}{q(\mathbf{z})}[\log p(\mathbf{y}, \mathbf{z})] - \mathbb{E}{q(\mathbf{z})}[\log q(\mathbf{z})]$$

Since KL divergence is non-negative, $\mathcal{L}(q) \leq \log p(\mathbf{y})$ with equality when $q = p(\cdot|\mathbf{y})$.

Application to sparse GPs:

In the sparse GP setting:

Data: Training targets $\mathbf{y}$
Latent variables: Inducing function values $\mathbf{u} = [f(\mathbf{z}_1), ..., f(\mathbf{z}_M)]^\top$
Auxiliary latents: Training function values $\mathbf{f} = [f(\mathbf{x}_1), ..., f(\mathbf{x}_N)]^\top$

The joint prior from the GP is:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$

where $p(\mathbf{u}) = \mathcal{N}(\mathbf{0}, K_{uu})$ and $p(\mathbf{f}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$.

The Key Insight

The variational approach treats inducing values u as the primary latent variables. Rather than approximating the N×N kernel matrix, we approximate the posterior p(u|y).Given the posterior on u, we can reconstruct the posterior on f via the GP conditional. This shifts the complexity from N to M.

The Variational Distribution

Following Titsias (2009), we choose a variational distribution over the joint $(\mathbf{f}, \mathbf{u})$ with specific structure:

$$q(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})q(\mathbf{u})$$

Critically, we retain the exact GP conditional $p(\mathbf{f}|\mathbf{u})$ from the prior. The approximation affects only the marginal over inducing variables, where we use:

$$q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, S)$$

with variational parameters $\mathbf{m} \in \mathbb{R}^M$ (mean) and $S \in \mathbb{R}^{M \times M}$ (covariance, constrained to be positive definite).

Why this structure?

The key insight is that under the GP prior, $p(\mathbf{f}|\mathbf{u})$ is already available in closed form. By keeping this conditional exact, we:

Preserve the correlations between training function values imposed by the kernel
Ensure predictions at new points inherit the proper GP structure
Reduce the variational free parameters from O(N²) to O(M²)

The marginal variational distribution over $\mathbf{f}$ becomes:

$$q(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u})q(\mathbf{u})d\mathbf{u} = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{m}, K_{ff} - K_{fu}K_{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}K_{uf})$$

Note that when $S = K_{uu}$, the variance reduces to $K_{ff}$—recovering the prior. When $S \to 0$, we get the DTC variance $K_{fu}K_{uu}^{-1}K_{uf}$.

Predictive distribution:

For a new test point $\mathbf{x}_*$, the variational posterior predictive is:

$$q(f_) = \int p(f_|\mathbf{u})q(\mathbf{u})d\mathbf{u} = \mathcal{N}(\mu_, \sigma_^2)$$

where:

$$\mu_* = \mathbf{k}{*u}K{uu}^{-1}\mathbf{m}$$ $$\sigma_^2 = k_{**} - \mathbf{k}{*u}K{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}\mathbf{k}_{u}$$

The mean involves only the M-dimensional inducing representation, making predictions O(M²) per test point. The variance correctly incorporates both prior uncertainty and posterior compression through $S$.

Comparison with Classical Methods

Setting q(u) to a point mass δ(u - K_{uu}^{-1}K_{uf}y) (deterministic) recovers DTC. FITC corresponds to a different approximation that retains diagonal variance. The variational framework's power is that we optimize over q(u) rather than fixing it heuristically.

Deriving the ELBO

We now derive the Evidence Lower Bound for the variational sparse GP. The complete ELBO is:

$$\mathcal{L} = \mathbb{E}_{q(\mathbf{f},\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$

Step 1: Expand using the variational structure

$$\mathcal{L} = \mathbb{E}{q(\mathbf{u})}\left[\mathbb{E}{p(\mathbf{f}|\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})]\right] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$

The inner expectation is over the exact GP conditional, and the outer is over the variational distribution on $\mathbf{u}$.

Step 2: Evaluate the expected log-likelihood

For Gaussian likelihood $p(\mathbf{y}|\mathbf{f}) = \mathcal{N}(\mathbf{y}; \mathbf{f}, \sigma_n^2 I)$:

$$\log p(\mathbf{y}|\mathbf{f}) = -\frac{N}{2}\log(2\pi\sigma_n^2) - \frac{1}{2\sigma_n^2}\|\mathbf{y} - \mathbf{f}\|^2$$

The expectation over $q(\mathbf{f})$ gives:

$$\mathbb{E}{q(\mathbf{f})}[\log p(\mathbf{y}|\mathbf{f})] = -\frac{N}{2}\log(2\pi\sigma_n^2) - \frac{1}{2\sigma_n^2}\left(\|\mathbf{y} - \mathbb{E}{q}[\mathbf{f}]\|^2 + \text{tr}(\text{Var}_q[\mathbf{f}])\right)$$

Step 3: Substitute the variational moments

With $\mathbb{E}{q}[\mathbf{f}] = K{fu}K_{uu}^{-1}\mathbf{m}$ and $$\text{Var}q[\mathbf{f}] = K{ff} - K_{fu}K_{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}K_{uf}$$

The expected log-likelihood becomes:

$$\mathbb{E}{q}[\log p(\mathbf{y}|\mathbf{f})] = \log \mathcal{N}(\mathbf{y}; K{fu}K_{uu}^{-1}\mathbf{m}, \sigma_n^2 I) - \frac{1}{2\sigma_n^2}\text{tr}(\tilde{K})$$

where $\tilde{K} = K_{ff} - K_{fu}K_{uu}^{-1}(K_{uu} - S)K_{uu}^{-1}K_{uf} = K_{ff} - Q_{ff} + K_{fu}K_{uu}^{-1}SK_{uu}^{-1}K_{uf}$.

Step 4: The complete ELBO

Combining with the KL divergence between Gaussians:

$$\text{KL}(q(\mathbf{u}) \| p(\mathbf{u})) = \frac{1}{2}\left(\text{tr}(K_{uu}^{-1}S) + \mathbf{m}^\top K_{uu}^{-1}\mathbf{m} - M + \log\frac{|K_{uu}|}{|S|}\right)$$

The full ELBO is:

$$\mathcal{L} = \log \mathcal{N}(\mathbf{y}; K_{fu}K_{uu}^{-1}\mathbf{m}, \sigma_n^2 I) - \frac{\text{tr}(K_{ff} - Q_{ff})}{2\sigma_n^2} - \frac{\text{tr}(K_{fu}K_{uu}^{-1}SK_{uu}^{-1}K_{uf})}{2\sigma_n^2} - \text{KL}(q \| p)$$

The Crucial Trace Term

The term tr(K_{ff} - Q_{ff})/(2σ²) is called the trace correction or approximation penalty. It measures how much variance the inducing points fail to capture and penalizes the ELBO accordingly. This is what distinguishes variational sparse GPs from DTC—DTC ignores this term, leading to overconfident predictions. The trace term provides principled regularization.

Optimal Variational Parameters

A remarkable property of the variational sparse GP ELBO is that the optimal $q(\mathbf{u})$ can be computed in closed form. Taking derivatives and setting to zero:

Optimal covariance:

$$S^* = (K_{uu}^{-1} + \sigma_n^{-2}K_{uu}^{-1}K_{uf}K_{fu}K_{uu}^{-1})^{-1}$$

Using the Woodbury identity:

$$S^* = K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uu}$$

Optimal mean:

$$\mathbf{m}^* = \sigma_n^{-2}S^*K_{uu}^{-1}K_{uf}\mathbf{y}$$

Simplifying:

$$\mathbf{m}^* = \sigma_n^{-2}K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uf}\mathbf{y}$$

Interpretation:

These optimal parameters have a clean interpretation. The posterior covariance $S^$ is exactly what you would get by running GP inference on the inducing points themselves, treating $K_{fu}K_{uu}^{-1}\mathbf{u}$ as a linear observation model. The posterior mean $\mathbf{m}^$ is the corresponding posterior mean under this model.

Plugging these optimal values back into the ELBO gives the collapsed ELBO (also called the VFE bound after Titsias):

$$\mathcal{L}^* = \log \mathcal{N}(\mathbf{y}; \mathbf{0}, Q_{ff} + \sigma_n^2 I) - \frac{\text{tr}(K_{ff} - Q_{ff})}{2\sigma_n^2}$$

VFE vs DTC

•DTC: log N(y; 0, Q_{ff} + σ²I)
•VFE: log N(y; 0, Q_{ff} + σ²I) - tr(K_{ff} - Q_{ff})/(2σ²)
•The difference is exactly the trace penalty
•VFE always ≤ DTC (tighter bound)
•Equal only when Q_{ff} = K_{ff} (perfect approximation)

Trace Term Effects

•Penalizes poor inducing point coverage
•Encourages placing Z where K_{ff} deviates from Q_{ff}
•Larger for short length-scales (more variance to capture)
•Can dominate likelihood term for bad approximations
•Provides implicit regularization of Z locations

Computational Complexity

Computing the collapsed ELBO requires: (1) O(NM²) for matrix products involving K_{fu}, (2) O(M³) for Cholesky of the M×M matrix, (3) O(N) for the trace of K_{ff} (just the diagonal sum). The total is O(NM² + M³), matching classical sparse methods but with a principled objective.

Stochastic Variational Inference for Massive Scale

The variational framework enables a crucial capability: stochastic optimization with minibatches. The collapsed ELBO can be decomposed as a sum over data points:

$$\mathcal{L} = \sum_{n=1}^N \mathcal{L}_n - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$

where $\mathcal{L}_n$ depends only on the n-th datapoint $(\mathbf{x}_n, y_n)$ and the global parameters $(\mathbf{m}, S, Z, \theta)$.

This additive structure allows us to estimate the ELBO using random minibatches:

$$\mathcal{L} \approx \frac{N}{B}\sum_{n \in \mathcal{B}} \mathcal{L}_n - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$

where $\mathcal{B}$ is a minibatch of B << N datapoints.

Complexity per iteration:

With minibatches, each gradient step requires:

O(BM²) for computing $K_{\mathcal{B}u}$ and related products
O(M³) for operations involving $K_{uu}$ (amortized)
Total: O(BM² + M³) per iteration

For B = 256 and M = 1000, this is approximately 260 million operations—tractable on a single GPU in milliseconds. Compare this to exact GP's O(N³): for N = 1,000,000, exact inference is completely intractable, while stochastic sparse GP training requires only hours of computation.

The SVGP algorithm:

Algorithm: Stochastic Variational Gaussian Process (SVGP)

Input: Data {(x_n, y_n)}, inducing points Z, batch size B, learning rate η
Output: Variational parameters m, S; hyperparameters θ

1. Initialize m = 0, S = K_uu, θ to reasonable values
2. For t = 1 to T:
   a. Sample minibatch B ⊂ {1, ..., N}
   b. Compute stochastic ELBO gradient:
      ∇ ≈ (N/B) Σ_{n∈B} ∇L_n - ∇KL(q||p)
   c. Update parameters via gradient ascent:
      (m, S, Z, θ) ← (m, S, Z, θ) + η · ∇
3. Return optimal parameters

Natural gradients:

For the variational posterior $q(\mathbf{u})$, natural gradients can dramatically accelerate convergence. The natural gradient pre-conditions the Euclidean gradient by the inverse Fisher information matrix, which for Gaussians has closed form. Natural gradients for $(\mathbf{m}, S)$ often converge in 10× fewer iterations.

Practical Scaling

With SVGP, Gaussian Processes become applicable to datasets with millions of points. Training proceeds analogously to neural network training: shuffle data, iterate through minibatches, update parameters via gradient descent. A well-tuned SVGP on 1M points with M=1000 inducing points can train in under an hour on a modern GPU.

svgp_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam
 
class SVGP(nn.Module):
    """
    Stochastic Variational Gaussian Process (SVGP).
    
    Implements the Titsias (2009) / Hensman et al. (2013) variational framework
    with minibatch training for scalability to large datasets.
    """
    
    def __init__(self, X_train: np.ndarray, M: int = 100, jitter: float = 1e-4):
        super().__init__()
        
        N, d = X_train.shape
        self.N = N
        self.M = M
        self.d = d
        self.jitter = jitter
        
        # Initialize inducing points via k-means
        from scipy.cluster.vq import kmeans2
        Z_init, _ = kmeans2(X_train, M, minit='++')
        
        # Learnable inducing point locations
        self.Z = nn.Parameter(torch.tensor(Z_init, dtype=torch.float64))
        
        # Variational parameters
        # Mean: m ∈ R^M
        self.m = nn.Parameter(torch.zeros(M, dtype=torch.float64))
        
        # Covariance: S = L_S @ L_S^T (Cholesky parameterization ensures PSD)
        L_S_init = torch.eye(M, dtype=torch.float64)
        self.L_S_raw = nn.Parameter(L_S_init)
        
        # Kernel hyperparameters (in log space for positivity)
        self.log_lengthscale = nn.Parameter(torch.zeros(d, dtype=torch.float64))
        self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64))
        self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64))
        
    @property
    def S(self):
        """Covariance matrix from Cholesky factor."""
        L = torch.tril(self.L_S_raw)
        return L @ L.T
    
    def kernel(self, X1, X2):
        """ARD squared exponential kernel."""
        lengthscales = torch.exp(self.log_lengthscale)
        variance = torch.exp(self.log_variance)
        
        # Scaled distances
        X1_scaled = X1 / lengthscales
        X2_scaled = X2 / lengthscales
        
        dist_sq = torch.sum(X1_scaled**2, dim=1, keepdim=True) +                   torch.sum(X2_scaled**2, dim=1) -                   2 * X1_scaled @ X2_scaled.T
        
        return variance * torch.exp(-0.5 * dist_sq)
    
    def elbo(self, X_batch: torch.Tensor, y_batch: torch.Tensor) -> torch.Tensor:
        """
        Compute stochastic ELBO for a minibatch.
        
        Args:
            X_batch: (B, d) minibatch inputs
            y_batch: (B,) minibatch targets
        
        Returns:
            Stochastic ELBO estimate
        """
        B = len(X_batch)
        M = self.M
        noise_var = torch.exp(self.log_noise)
        
        # Kernel matrices
        Kuu = self.kernel(self.Z, self.Z) + self.jitter * torch.eye(M, dtype=torch.float64)
        Kbu = self.kernel(X_batch, self.Z)  # B x M
        Kbb_diag = torch.exp(self.log_variance) * torch.ones(B, dtype=torch.float64)  # Diagonal only
        
        # Cholesky of Kuu
        L_uu = torch.linalg.cholesky(Kuu)
        
        # α = L_uu^{-1} K_ub^T (M x B)
        alpha = torch.linalg.solve_triangular(L_uu, Kbu.T, upper=False)
        
        # Q_bb diagonal: diag(K_bu K_uu^{-1} K_ub) = column sums of α²
        Qbb_diag = torch.sum(alpha**2, dim=0)
        
        # Variational mean at batch points: μ_b = K_bu K_uu^{-1} m
        Kuu_inv_m = torch.linalg.solve_triangular(
            L_uu.T,
            torch.linalg.solve_triangular(L_uu, self.m, upper=False),
            upper=True
        )
        mu_batch = Kbu @ Kuu_inv_m
        
        # Expected log-likelihood
        # E_q[log p(y|f)] = log N(y; μ, σ²) - (1/2σ²) * E_q[(f - μ)²]
        # = log N(y; μ_q, σ²) - (1/2σ²) * Var_q[f]
        
        # Variance from variational posterior
        # Var_q[f_n] = k_nn - k_nu K_uu^{-1} (K_uu - S) K_uu^{-1} k_un
        #            = k_nn - Q_nn + k_nu K_uu^{-1} S K_uu^{-1} k_un
        
        # For computational efficiency, compute the S contribution
        # K_uu^{-1} S K_uu^{-1} = (L_uu^{-1} S L_uu^{-T})
        L_uu_inv_S = torch.linalg.solve_triangular(L_uu, self.S, upper=False)
        S_contrib = torch.linalg.solve_triangular(L_uu.T, L_uu_inv_S.T, upper=True).T
        
        # Variance at batch points
        # Using: var = k_nn - Q_nn + Σ contribution
        var_S_part = torch.sum((Kbu @ S_contrib) * Kbu, dim=1)
        var_batch = Kbb_diag - Qbb_diag + var_S_part
        
        # Expected log-likelihood per point
        # E[log p(y|f)] = -0.5*log(2πσ²) - (y-μ)²/(2σ²) - var/(2σ²)
        sq_error = (y_batch - mu_batch)**2
        ell = -0.5 * torch.log(2 * np.pi * noise_var) - sq_error / (2 * noise_var) - var_batch / (2 * noise_var)
        
        # Scale by N/B for stochastic estimate
        expected_ll = (self.N / B) * torch.sum(ell)
        
        # Trace term: tr(K_ff - Q_ff)
        # Approximate with batch: (N/B) * tr(K_bb - Q_bb)
        trace_term = (self.N / B) * torch.sum(Kbb_diag - Qbb_diag) / (2 * noise_var)
        
        # KL divergence: KL(q(u) || p(u))
        # = 0.5 * (tr(K_uu^{-1} S) + m^T K_uu^{-1} m - M + log|K_uu| - log|S|)
        L_S = torch.tril(self.L_S_raw)
        
        kl_trace = torch.sum(torch.linalg.solve_triangular(L_uu, L_S, upper=False)**2)
        kl_quad = torch.dot(Kuu_inv_m, self.m)
        kl_logdet = 2 * torch.sum(torch.log(torch.diag(L_uu))) - 2 * torch.sum(torch.log(torch.abs(torch.diag(L_S))))
        
        kl = 0.5 * (kl_trace + kl_quad - M + kl_logdet)
        
        # ELBO = E[log p(y|f)] - trace_term - KL
        elbo = expected_ll - trace_term - kl
        
        return elbo
    
    def predict(self, X_star: torch.Tensor):
        """Predictive mean and variance at test points."""
        with torch.no_grad():
            Ksu = self.kernel(X_star, self.Z)
            Kss_diag = torch.exp(self.log_variance) * torch.ones(len(X_star), dtype=torch.float64)
            
            Kuu = self.kernel(self.Z, self.Z) + self.jitter * torch.eye(self.M, dtype=torch.float64)
            L_uu = torch.linalg.cholesky(Kuu)
            
            # Mean
            Kuu_inv_m = torch.linalg.solve_triangular(
                L_uu.T,
                torch.linalg.solve_triangular(L_uu, self.m, upper=False),
                upper=True
            )
            mu = Ksu @ Kuu_inv_m
            
            # Variance (incorporating S)
            alpha = torch.linalg.solve_triangular(L_uu, Ksu.T, upper=False)
            Qss_diag = torch.sum(alpha**2, dim=0)
            
            L_S = torch.tril(self.L_S_raw)
            L_uu_inv_S = torch.linalg.solve_triangular(L_uu, self.S, upper=False)
            correction = torch.linalg.solve_triangular(L_uu.T, L_uu_inv_S.T, upper=True).T
            var_S = torch.sum((Ksu @ correction) * Ksu, dim=1)
            
            var = Kss_diag - Qss_diag + var_S
            
            return mu, torch.clamp(var, min=1e-6)
 
def train_svgp(X, y, M=100, batch_size=256, n_epochs=50, lr=0.01):
    """Train SVGP with minibatch optimization."""
    model = SVGP(X, M=M)
    X_tensor = torch.tensor(X, dtype=torch.float64)
    y_tensor = torch.tensor(y, dtype=torch.float64)
    
    optimizer = Adam(model.parameters(), lr=lr)
    N = len(X)
    
    print(f"Training SVGP: N={N}, M={M}, batch_size={batch_size}")
    print("=" * 50)
    
    for epoch in range(n_epochs):
        # Shuffle data
        perm = torch.randperm(N)
        epoch_elbo = 0.0
        n_batches = 0
        
        for i in range(0, N, batch_size):
            idx = perm[i:i+batch_size]
            X_batch = X_tensor[idx]
            y_batch = y_tensor[idx]
            
            optimizer.zero_grad()
            elbo = model.elbo(X_batch, y_batch)
            (-elbo).backward()  # Minimize negative ELBO
            optimizer.step()
            
            epoch_elbo += elbo.item()
            n_batches += 1
        
        avg_elbo = epoch_elbo / n_batches
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Avg ELBO = {avg_elbo:.2f}")
    
    return model
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    torch.manual_seed(42)
    
    # Generate larger dataset to demonstrate scalability
    N = 10000
    X = np.random.randn(N, 2) * 3
    y = np.sin(X[:, 0]) + 0.5 * np.cos(0.5 * X[:, 1]) + 0.1 * np.random.randn(N)
    
    model = train_svgp(X, y, M=100, batch_size=256, n_epochs=30)
    
    # Make predictions
    X_test = np.random.randn(100, 2)
    mu, var = model.predict(torch.tensor(X_test, dtype=torch.float64))
    print(f"\nPredictions: mean in [{mu.min():.2f}, {mu.max():.2f}]")
    print(f"Uncertainty: std in [{var.sqrt().min():.3f}, {var.sqrt().max():.3f}]")

Modern Extensions and Innovations

The variational sparse GP framework has been extended in numerous directions since its introduction. Key innovations include:

1. Inter-domain Variational GPs (Lázaro-Gredilla & Figueiras-Vidal, 2009):

Inducing points need not be in the same domain as training points. "Inter-domain" inducing variables can represent integrals, derivatives, or Fourier coefficients of the function. This enables more expressive representations with fewer inducing points.

2. Deep Gaussian Processes (Damianou & Lawrence, 2013):

Stack multiple GP layers, where the output of one GP becomes the input to the next. Variational sparse GPs make deep GPs computationally tractable:

$$\mathbf{f}^{(1)} \sim \mathcal{GP}(0, k^{(1)}(\mathbf{x}, \mathbf{x}'))$$ $$\mathbf{f}^{(2)} \sim \mathcal{GP}(0, k^{(2)}(\mathbf{f}^{(1)}, \mathbf{f}^{(1)'}))$$

Each layer uses its own inducing points and variational posterior.

3. Multi-output GPs (Alvarez et al., 2010):

Extend to vector-valued outputs with correlated components. Sparse approximations make multi-output GPs practical for multi-task and multi-fidelity learning.

4. Non-Gaussian likelihoods:

The variational framework extends beyond Gaussian noise to classification, count data, and other likelihoods. The expected log-likelihood no longer has closed form, requiring additional approximations:

Gauss-Hermite quadrature: Numerical integration over univariate Gaussians
Monte Carlo: Sample-based gradient estimation
Reparameterization trick: Low-variance gradient estimators

5. Orthogonally Decoupled Variational GPs (Salimbeni et al., 2018):

Parameterize the variational posterior more flexibly using orthogonal bases. This improves expressiveness while maintaining tractability.

6. Sparse Variational GPs with Streaming Data (Bui et al., 2017):

Extend to online settings where data arrives sequentially. The variational posterior is updated incrementally without storing all historical data.

Software Ecosystem

Modern GP libraries implement variational sparse GPs as their primary scalable method: GPyTorch (PyTorch), GPflow (TensorFlow), and Pyro (PyTorch probabilistic programming) all provide production-quality SVGP implementations with GPU support, natural gradients, and automatic differentiation for hyperparameter learning.

Practical Considerations for SVGP

Deploying stochastic variational GPs in practice requires attention to several details that can significantly impact performance:

SVGP Hyperparameter Guidelines
Parameter	Typical Range	Guidance
Inducing points (M)	100-2000	Start with √N; increase if ELBO plateaus with more iterations
Batch size (B)	64-1024	Larger is more stable; smaller allows faster iterations
Learning rate	0.001-0.1	Use Adam; reduce if unstable, increase if slow
Epochs	50-500	Monitor validation ELBO for convergence
Natural gradients	On/Off	Enable for m, S learning; often 2-10× faster

Best Practices

•Initialize carefully: Use k-means for inducing points, prior for hyperparameters. Poor initialization can lead to very slow convergence.
•Warm up the learning rate: Start with lower LR for first few epochs to avoid early divergence.
•Monitor the ELBO decomposition: Track the log-likelihood term, trace term, and KL term separately to diagnose issues.
•Validate on held-out data: ELBO can be misleading; use prediction metrics (RMSE, log-likelihood) on validation set.
•Watch for inducing point collapse: If many Z converge to same location, increase jitter or add diversity regularization.
•Use whitened parameterization: Reparameterize m = L_uu v, S = L_uu Σ L_uu^T for better conditioning.

Common Pitfall: Over-optimizing

SVGP can overfit the ELBO if trained too long, especially with learned inducing points. The noise variance σ² may shrink to near-zero, the trace term may dominate, and predictions become overconfident. Use validation-based early stopping and regularize the noise variance lower bound.

Summary: The Variational Paradigm

We have developed the variational sparse GP framework—the foundation of modern scalable Gaussian Process methods. The key insights from this page:

Key Takeaways

•Variational inference provides principled foundations: Rather than ad-hoc approximations, we optimize a rigorous lower bound (ELBO) on the log marginal likelihood.
•The variational structure q(f,u) = p(f|u)q(u) is key: Retaining the exact GP conditional preserves proper correlations while reducing parameters from O(N²) to O(M²).
•The trace term is crucial for calibration: This penalty measures approximation quality and prevents the systematic variance underestimation of DTC.
•Optimal variational parameters have closed form: The posterior q(u) can be computed analytically, yielding the collapsed VFE bound.
•Stochastic optimization enables massive scale: Minibatch training reduces per-iteration cost to O(BM² + M³), making GPs tractable for millions of points.
•Modern software makes SVGP accessible: GPyTorch, GPflow, and similar libraries provide production-quality implementations.

What's next:

The final page in this module explores an orthogonal approach to GP scalability: random feature approximations. Rather than compressing through inducing points, random features approximate the kernel itself via finite-dimensional feature maps. This provides different trade-offs and is particularly effective for certain kernel families. Together with variational sparse GPs, these methods cover the full spectrum of scalable GP techniques.

Page Complete

You now understand the variational sparse GP framework that powers modern scalable GP implementations. The ELBO objective, stochastic optimization, and principled uncertainty propagation enable Gaussian Processes at scales that were previously unthinkable—from domains like computer vision and natural language processing to industrial applications with millions of observations.