Loading content...
Both Laplace approximation and Expectation Propagation find Gaussian approximations to the classification posterior, but through fundamentally different mechanisms—mode-finding and moment matching, respectively. Variational Inference (VI) offers a third paradigm: casting approximate inference as an optimization problem.
The variational approach seeks the distribution $q$ from a tractable family that minimizes a divergence from the true posterior. This optimization perspective has profound practical advantages: it naturally interfaces with modern optimization tools (stochastic gradient descent, automatic differentiation) and provides a principled framework for sparse approximations that scale GP classification to large datasets.
In this page, we develop variational GP classification from first principles, arriving at methods that power modern GP libraries like GPflow and GPyTorch.
By the end of this page, you will understand: (1) The variational inference framework and the Evidence Lower Bound (ELBO), (2) How to derive the ELBO for GP classification, (3) The reparameterization trick for gradient estimation, (4) Sparse variational GP classification with inducing points, (5) Stochastic optimization for large-scale problems, and (6) Connections to variational autoencoders and modern deep learning.
Variational inference reformulates posterior approximation as optimization. We seek a distribution $q(\mathbf{f})$ from a tractable family $\mathcal{Q}$ that is 'closest' to the true posterior.
The KL Divergence Objective:
We want to minimize:
$$\text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | \mathbf{y}, X)) = \int q(\mathbf{f}) \log \frac{q(\mathbf{f})}{p(\mathbf{f} | \mathbf{y}, X)} d\mathbf{f}$$
But this requires the intractable posterior! The variational trick is to rearrange this into a computable objective.
Deriving the ELBO:
Using $p(\mathbf{f} | \mathbf{y}, X) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) / p(\mathbf{y} | X)$:
$$\text{KL}(q ,||, p) = \log p(\mathbf{y} | X) - \left[\mathbb{E}_q[\log p(\mathbf{y} | \mathbf{f})] - \text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | X))\right]$$
Since KL ≥ 0:
$$\log p(\mathbf{y} | X) \geq \underbrace{\mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] - \text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | X))}{\text{ELBO}}$$
The ELBO (Evidence Lower BOund) lower-bounds the log marginal likelihood (evidence). Maximizing the ELBO simultaneously: (1) Finds q close to the posterior (minimizes KL), and (2) Provides an approximation to the marginal likelihood for hyperparameter learning. The gap between ELBO and log-evidence equals the KL divergence.
The ELBO Decomposition:
$$\mathcal{L}\text{ELBO} = \underbrace{\mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})]}{\text{Expected log-likelihood}} - \underbrace{\text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | X))}{\text{Complexity penalty}}$$
This decomposition has an intuitive interpretation:
Expected log-likelihood: Encourages $q$ to place mass where the data is likely. This term measures how well the approximate posterior explains the observations.
KL regularizer: Penalizes $q$ for deviating from the prior. This prevents overfitting and encodes our prior beliefs about function smoothness.
Maximizing the ELBO balances data fit against prior fidelity—exactly the Bayesian trade-off.
The choice of variational family $\mathcal{Q}$ determines both the quality of approximation and computational tractability. For GP classification, we use Gaussian variational distributions.
Full Gaussian Variational Posterior:
$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$$
where $\boldsymbol{\mu} \in \mathbb{R}^n$ and $\Sigma \in \mathbb{R}^{n \times n}$ (symmetric positive definite) are variational parameters.
Parameter Count:
This is prohibitive for large $n$. Common restrictions:
Computing the KL Term:
For the GP prior $p(\mathbf{f} | X) = \mathcal{N}(\mathbf{f} | \mathbf{m}, K)$ and variational posterior $q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$:
$$\text{KL}(q ,||, p) = \frac{1}{2}\left[\text{tr}(K^{-1}\Sigma) + (\boldsymbol{\mu} - \mathbf{m})^\top K^{-1}(\boldsymbol{\mu} - \mathbf{m}) - n + \log|K| - \log|\Sigma|\right]$$
This is the KL divergence between two Gaussians and has the closed form above.
Computing the Expected Log-Likelihood:
For classification with independent likelihoods:
$$\mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] = \sum{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)]$$
Each term is a one-dimensional integral:
$$\mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] = \int \log p(y_i | f_i) \cdot \mathcal{N}(f_i | \mu_i, \sigma_i^2) , df_i$$
For logistic likelihood, this integral is intractable but can be approximated via Gauss-Hermite quadrature or Monte Carlo sampling.
The 'mean-field' approximation uses diagonal Σ, assuming posterior independence between latent values. This is computationally cheap but ignores correlations. For GP classification, correlations matter! Full or low-rank covariance is often worth the extra cost, or use inducing point methods that naturally capture correlations.
Maximizing the ELBO with respect to variational parameters requires computing gradients. The KL term has a closed-form gradient. The expected log-likelihood is the challenge.
The Monte Carlo Gradient Estimator:
A naive approach: sample $f^{(s)} \sim q(\mathbf{f})$ and estimate:
$$\nabla_\phi \mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] \approx \frac{1}{S}\sum{s=1}^S \log p(\mathbf{y} | f^{(s)}) \nabla_\phi \log q(f^{(s)})$$
This is the score function estimator (REINFORCE). It's unbiased but has high variance.
The Reparameterization Trick:
A lower-variance alternative: express samples as a deterministic transformation of a parameter-free random variable.
For $q(\mathbf{f}) = \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ with $\Sigma = LL^\top$ (Cholesky):
$$\mathbf{f} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$$
Now the gradient:
$$\nabla_\phi \mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] = \mathbb{E}{\boldsymbol{\epsilon}}\left[\nabla_\phi \log p(\mathbf{y} | \boldsymbol{\mu} + L\boldsymbol{\epsilon})\right]$$
where the gradient can be computed via automatic differentiation through the log-likelihood.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
import numpy as npimport torchimport torch.nn as nnfrom torch.distributions import Normal, kl_divergence class VariationalGPC(nn.Module): """ Variational Gaussian Process Classification. Uses reparameterization trick for gradient estimation. """ def __init__(self, n_train, kernel_fn, X_train, y_train): super().__init__() self.n = n_train self.X = X_train self.y = y_train # Precompute kernel matrix self.K = torch.tensor(kernel_fn(X_train, X_train), dtype=torch.float32) self.K_chol = torch.linalg.cholesky(self.K + 1e-6 * torch.eye(self.n)) # Variational parameters (using whitened parameterization) # q(f) = N(μ, Σ) where μ = L_K @ m_white, Σ = L_K @ S_white @ L_K^T self.m_white = nn.Parameter(torch.zeros(n_train)) self.L_white_diag = nn.Parameter(torch.zeros(n_train)) # log of diagonal def get_variational_distribution(self): """Compute variational mean and covariance.""" # Diagonal covariance in whitened space L_white = torch.diag(torch.exp(self.L_white_diag)) # Transform to original space mu = self.K_chol @ self.m_white # Σ = L_K @ S_white @ L_K^T where S_white = L_white @ L_white^T L_cov = self.K_chol @ L_white return mu, L_cov def elbo(self, n_samples=10): """ Compute the ELBO using reparameterization trick. ELBO = E_q[log p(y|f)] - KL(q(f) || p(f)) """ mu, L_cov = self.get_variational_distribution() # KL divergence: KL(N(μ, Σ) || N(0, K)) # In whitened parameterization, this simplifies to: # KL = 0.5 * (tr(S_white) + m_white^T m_white - n - log|S_white|) S_white_diag = torch.exp(2 * self.L_white_diag) kl = 0.5 * (S_white_diag.sum() + (self.m_white**2).sum() - self.n - 2 * self.L_white_diag.sum()) # Expected log-likelihood via Monte Carlo epsilon = torch.randn(n_samples, self.n) f_samples = mu.unsqueeze(0) + epsilon @ L_cov.T # (n_samples, n) # Logistic log-likelihood y_expanded = self.y.unsqueeze(0).expand(n_samples, -1) log_likelihood = (y_expanded * torch.log(torch.sigmoid(f_samples)) + (1 - y_expanded) * torch.log(1 - torch.sigmoid(f_samples))) expected_ll = log_likelihood.sum(dim=1).mean() # Average over samples elbo = expected_ll - kl return elbo def fit(self, n_iter=500, lr=0.1): """Optimize the ELBO.""" optimizer = torch.optim.Adam(self.parameters(), lr=lr) for i in range(n_iter): optimizer.zero_grad() loss = -self.elbo(n_samples=10) # Negative ELBO for minimization loss.backward() optimizer.step() if (i + 1) % 100 == 0: print(f"Iter {i+1}: ELBO = {-loss.item():.4f}") return self def predict(self, X_test, kernel_fn, n_samples=100): """Predict class probabilities at test points.""" with torch.no_grad(): mu, L_cov = self.get_variational_distribution() # Test kernel vectors K_star = torch.tensor(kernel_fn(X_test, self.X.numpy()), dtype=torch.float32) K_star_star = torch.tensor( np.diag(kernel_fn(X_test, X_test)), dtype=torch.float32 ) # Predictive distribution # μ* = K_*^T K^{-1} μ alpha = torch.linalg.solve(self.K, mu) mu_star = K_star @ alpha # σ*² = k(x*, x*) - K_*^T K^{-1} K_* + K_*^T K^{-1} Σ K^{-1} K_* v = torch.linalg.solve(self.K_chol, K_star.T) var_reduction = (v**2).sum(dim=0) # Add variance from q(f) beta = torch.linalg.solve(self.K, L_cov) var_addition = ((K_star @ beta)**2).sum(dim=1) sigma_star_sq = K_star_star - var_reduction + var_addition sigma_star = torch.sqrt(torch.clamp(sigma_star_sq, min=1e-6)) # Monte Carlo for class probability epsilon = torch.randn(n_samples, len(X_test)) f_samples = mu_star.unsqueeze(0) + epsilon * sigma_star.unsqueeze(0) pi_star = torch.sigmoid(f_samples).mean(dim=0) return pi_star.numpy(), mu_star.numpy(), sigma_star.numpy()The code uses a 'whitened' parameterization where we optimize in a space where the prior is N(0, I). This improves optimization by making the geometry more uniform. The transformation is: f = L_K @ f_white where L_K is the Cholesky of K.
The full variational approach requires $O(n^2)$ parameters and $O(n^3)$ computation for the KL term. For large datasets, this is prohibitive. Sparse variational methods introduce $m \ll n$ inducing points that summarize the posterior.
The Inducing Point Framework:
Introduce $m$ inducing inputs $Z = {\mathbf{z}_1, ..., \mathbf{z}_m}$ and corresponding inducing variables $\mathbf{u} = [f(\mathbf{z}_1), ..., f(\mathbf{z}_m)]^\top$.
The key identity: for a GP, conditioned on $\mathbf{u}$, the training latents $\mathbf{f}$ are:
$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | K_{fu}K_{uu}^{-1}\mathbf{u}, , K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$$
where $K_{fu}$ is the kernel between training and inducing points.
The Variational Approximation:
Place a variational distribution only on the inducing variables:
$$q(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{m}, S)$$
The variational distribution on $\mathbf{f}$ is then:
$$q(\mathbf{f}) = \int p(\mathbf{f} | \mathbf{u}) q(\mathbf{u}) d\mathbf{u}$$
This integral is tractable (convolution of Gaussians):
$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | A\mathbf{m}, , K_{ff} - A(K_{uu} - S)A^\top)$$
where $A = K_{fu}K_{uu}^{-1}$.
| Aspect | Full Variational | Sparse Variational |
|---|---|---|
| Variational parameters | O(n²) | O(m²) |
| KL computation | O(n³) | O(m³) |
| Expected log-likelihood | O(n) | O(nm²) |
| Total per iteration | O(n³) | O(nm² + m³) |
| Memory | O(n²) | O(nm + m²) |
The Sparse ELBO:
$$\mathcal{L}\text{sparse} = \sum{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) ,||, p(\mathbf{u}))$$
where $p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | 0, K_{uu})$ is the inducing point prior.
The KL term is between $m$-dimensional Gaussians (tractable). Each expected log-likelihood term integrates over a univariate Gaussian marginal.
Inducing Point Placement:
The inducing points $Z$ can be:
Learning inducing locations often significantly improves performance, but increases optimization complexity.
Sparse variational GPs have an additional source of approximation error: the inducing point approximation itself. With m points, we can only represent m degrees of freedom in the posterior. The sparse ELBO lower-bounds the full ELBO, which lower-bounds the true log-evidence. Quality depends crucially on m and inducing point placement.
The sparse ELBO decomposes as a sum over data points, enabling stochastic optimization with mini-batches.
Mini-batch ELBO:
For a mini-batch $\mathcal{B} \subset {1, ..., n}$ of size $b$:
$$\mathcal{L}\text{batch} = \frac{n}{b}\sum{i \in \mathcal{B}} \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) ,||, p(\mathbf{u}))$$
The $n/b$ factor scales the mini-batch expected log-likelihood to estimate the full sum.
Algorithm: Stochastic Variational GPC:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npimport torchimport torch.nn as nn class SparseVariationalGPC(nn.Module): """ Sparse Variational Gaussian Process Classification. Scales to large datasets via inducing points and mini-batching. """ def __init__(self, X_train, y_train, n_inducing=50, kernel_params=None): super().__init__() self.n = len(y_train) self.X = torch.tensor(X_train, dtype=torch.float32) self.y = torch.tensor(y_train, dtype=torch.float32) # Kernel parameters (learnable) self.log_lengthscale = nn.Parameter(torch.tensor(0.0)) self.log_variance = nn.Parameter(torch.tensor(0.0)) # Inducing points (learnable locations) # Initialize with k-means or random subset indices = np.random.choice(len(X_train), n_inducing, replace=False) self.Z = nn.Parameter(torch.tensor(X_train[indices], dtype=torch.float32)) self.m = n_inducing # Variational parameters for q(u) self.q_mu = nn.Parameter(torch.zeros(n_inducing)) self.q_L = nn.Parameter(torch.eye(n_inducing)) # Cholesky of q(u) covariance def kernel(self, X1, X2): """RBF kernel with learnable parameters.""" lengthscale = torch.exp(self.log_lengthscale) variance = torch.exp(self.log_variance) # Squared Euclidean distance dist_sq = torch.cdist(X1 / lengthscale, X2 / lengthscale, p=2)**2 return variance * torch.exp(-0.5 * dist_sq) def elbo(self, batch_indices=None, n_samples=5): """ Compute the sparse variational ELBO. """ if batch_indices is None: batch_indices = torch.arange(self.n) batch_size = len(batch_indices) X_batch = self.X[batch_indices] y_batch = self.y[batch_indices] # Compute kernel matrices K_uu = self.kernel(self.Z, self.Z) + 1e-6 * torch.eye(self.m) K_bu = self.kernel(X_batch, self.Z) # Cholesky of K_uu L_uu = torch.linalg.cholesky(K_uu) # q(u) covariance: S = L_q @ L_q^T L_q = torch.tril(self.q_L) S = L_q @ L_q.T # KL divergence: KL(q(u) || p(u)) # KL = 0.5 * (tr(K_uu^{-1} S) + μ^T K_uu^{-1} μ - m + log|K_uu| - log|S|) alpha = torch.linalg.solve(L_uu, self.q_mu) beta = torch.linalg.solve(L_uu, L_q) kl = 0.5 * ( (beta**2).sum() + # tr(K_uu^{-1} S) (alpha**2).sum() - # μ^T K_uu^{-1} μ self.m + 2 * torch.log(torch.diag(L_uu)).sum() - # log|K_uu| 2 * torch.log(torch.diag(L_q).abs()).sum() # log|S| ) # Compute q(f) for batch: mean and variance # A = K_bu @ K_uu^{-1} A = torch.linalg.solve(L_uu.T, torch.linalg.solve(L_uu, K_bu.T)).T # q(f) mean: A @ q_mu f_mean = A @ self.q_mu # q(f) variance: k(x,x) - A K_uu A^T + A S A^T (diagonal only) K_bb_diag = torch.exp(self.log_variance) * torch.ones(batch_size) v = torch.linalg.solve(L_uu, K_bu.T) var_reduction = (v**2).sum(dim=0) w = A @ L_q var_addition = (w**2).sum(dim=1) f_var = K_bb_diag - var_reduction + var_addition f_std = torch.sqrt(torch.clamp(f_var, min=1e-6)) # Expected log-likelihood via reparameterization epsilon = torch.randn(n_samples, batch_size) f_samples = f_mean.unsqueeze(0) + epsilon * f_std.unsqueeze(0) # Logistic log-likelihood y_exp = y_batch.unsqueeze(0).expand(n_samples, -1) log_lik = (y_exp * torch.log(torch.sigmoid(f_samples) + 1e-8) + (1 - y_exp) * torch.log(1 - torch.sigmoid(f_samples) + 1e-8)) # Scale by n/batch_size to estimate full sum expected_ll = log_lik.sum(dim=1).mean() * (self.n / batch_size) elbo = expected_ll - kl return elbo def fit(self, n_iter=1000, batch_size=100, lr=0.01): """Stochastic optimization of ELBO.""" optimizer = torch.optim.Adam(self.parameters(), lr=lr) for i in range(n_iter): optimizer.zero_grad() # Sample mini-batch batch_indices = torch.randperm(self.n)[:batch_size] loss = -self.elbo(batch_indices, n_samples=5) loss.backward() optimizer.step() if (i + 1) % 200 == 0: with torch.no_grad(): full_elbo = self.elbo(n_samples=10) print(f"Iter {i+1}: ELBO ≈ {full_elbo.item():.4f}") return self| Aspect | Laplace | EP | Variational |
|---|---|---|---|
| Approximation type | Mode + Hessian | Moment matching | ELBO optimization |
| KL direction | N/A (Taylor) | Inclusive (p||q) | Exclusive (q||p) |
| Convergence | Guaranteed | Not guaranteed | To local optimum |
| Variance estimate | Often too small | Good | Controlled by family |
| Scalability | O(n³) | O(n³) | O(nm²) with inducing |
| Stochastic training | No | No | Yes (mini-batches) |
| Implementation | Simple | Complex | Moderate with autodiff |
| Library support | GPy, sklearn | GPy, GPflow | GPflow, GPyTorch |
When to Use Each Method:
| Scenario | Recommended Method | Reason |
|---|---|---|
| Small dataset (n < 1000) | EP | Best accuracy |
| Medium dataset (1K-10K) | Laplace or VI | Good trade-off |
| Large dataset (n > 10K) | Sparse VI | Only scalable option |
| Need hyperparameter optimization | VI | Clean gradient flow |
| Need uncertainty calibration | EP or VI | Laplace underestimates |
| Rapid prototyping | Laplace | Simplest to implement |
In modern GP software (GPflow, GPyTorch), sparse variational inference is the default for classification. It seamlessly handles large datasets, provides a unified framework for hyperparameter learning, and integrates naturally with deep learning frameworks for hybrid GP-neural network models.
Variational GP classification has deep connections to modern deep learning, particularly through the Variational Autoencoder (VAE) framework.
Shared Structure:
Both VAE and variational GPC optimize an ELBO:
$$\mathcal{L} = \mathbb{E}_{q}[\log p(y | z)] - \text{KL}(q(z) || p(z))$$
| Component | VAE | Variational GPC |
|---|---|---|
| Latent variable | $z$ (learned representation) | $f$ (function values) |
| Likelihood | Bernoulli/Gaussian | Bernoulli (sigmoid) |
| Prior | $\mathcal{N}(0, I)$ | $\mathcal{N}(0, K)$ (GP) |
| Variational family | $\mathcal{N}(\mu(x), \sigma(x)^2)$ | $\mathcal{N}(\mu, \Sigma)$ |
| Correlation structure | Optional | From kernel |
Key Difference:
In VAEs, the prior is fixed ($\mathcal{N}(0, I)$) and the goal is to learn a latent representation. In GPC, the prior (through the kernel) encodes rich structural assumptions about the function.
Deep Kernel Learning:
A powerful hybrid combines neural networks with GPs:
$$k(x, x') = k_\text{base}(g_\theta(x), g_\theta(x'))$$
where $g_\theta$ is a neural network that learns feature representations, and $k_\text{base}$ is a standard kernel (e.g., RBF) in the learned feature space.
Variational inference enables end-to-end training:
all optimized jointly by maximizing the ELBO.
Automatic Differentiation:
Modern GP libraries leverage PyTorch/TensorFlow autodiff:
This unifies GP inference with the broader deep learning ecosystem.
Neural networks excel at learning features from raw data. GPs excel at principled uncertainty quantification. Combining them via deep kernels gives the best of both worlds: powerful feature learning with calibrated uncertainty—crucial for safety-critical applications like medical diagnosis or autonomous systems.
Variational inference provides a flexible, scalable framework for GP classification that integrates seamlessly with modern deep learning tools.
What's Next:
The final page of this module extends GP classification to multi-class problems. We'll see how the binary classification framework generalizes, explore the softmax likelihood, and understand the computational challenges of multi-output GPs.
You now understand variational inference for GP classification—from the ELBO derivation, through sparse approximations with inducing points, to stochastic optimization for large-scale problems. This framework powers modern GP libraries and enables integration with deep learning.