Gp Classification - Learning Module

Loading content...

0/245

Variational Inference for GP Classification

Optimization-Based Inference

Both Laplace approximation and Expectation Propagation find Gaussian approximations to the classification posterior, but through fundamentally different mechanisms—mode-finding and moment matching, respectively. Variational Inference (VI) offers a third paradigm: casting approximate inference as an optimization problem.

The variational approach seeks the distribution $q$ from a tractable family that minimizes a divergence from the true posterior. This optimization perspective has profound practical advantages: it naturally interfaces with modern optimization tools (stochastic gradient descent, automatic differentiation) and provides a principled framework for sparse approximations that scale GP classification to large datasets.

In this page, we develop variational GP classification from first principles, arriving at methods that power modern GP libraries like GPflow and GPyTorch.

What You Will Learn

By the end of this page, you will understand: (1) The variational inference framework and the Evidence Lower Bound (ELBO), (2) How to derive the ELBO for GP classification, (3) The reparameterization trick for gradient estimation, (4) Sparse variational GP classification with inducing points, (5) Stochastic optimization for large-scale problems, and (6) Connections to variational autoencoders and modern deep learning.

The Variational Framework

Variational inference reformulates posterior approximation as optimization. We seek a distribution $q(\mathbf{f})$ from a tractable family $\mathcal{Q}$ that is 'closest' to the true posterior.

The KL Divergence Objective:

We want to minimize:

$$\text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | \mathbf{y}, X)) = \int q(\mathbf{f}) \log \frac{q(\mathbf{f})}{p(\mathbf{f} | \mathbf{y}, X)} d\mathbf{f}$$

But this requires the intractable posterior! The variational trick is to rearrange this into a computable objective.

Deriving the ELBO:

Using $p(\mathbf{f} | \mathbf{y}, X) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) / p(\mathbf{y} | X)$:

$$\text{KL}(q ,||, p) = \log p(\mathbf{y} | X) - \left[\mathbb{E}_q[\log p(\mathbf{y} | \mathbf{f})] - \text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | X))\right]$$

Since KL ≥ 0:

$$\log p(\mathbf{y} | X) \geq \underbrace{\mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] - \text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | X))}{\text{ELBO}}$$

The Evidence Lower Bound

The ELBO (Evidence Lower BOund) lower-bounds the log marginal likelihood (evidence). Maximizing the ELBO simultaneously: (1) Finds q close to the posterior (minimizes KL), and (2) Provides an approximation to the marginal likelihood for hyperparameter learning. The gap between ELBO and log-evidence equals the KL divergence.

The ELBO Decomposition:

$$\mathcal{L}\text{ELBO} = \underbrace{\mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})]}{\text{Expected log-likelihood}} - \underbrace{\text{KL}(q(\mathbf{f}) ,||, p(\mathbf{f} | X))}{\text{Complexity penalty}}$$

This decomposition has an intuitive interpretation:

Expected log-likelihood: Encourages $q$ to place mass where the data is likely. This term measures how well the approximate posterior explains the observations.
KL regularizer: Penalizes $q$ for deviating from the prior. This prevents overfitting and encodes our prior beliefs about function smoothness.

Maximizing the ELBO balances data fit against prior fidelity—exactly the Bayesian trade-off.

The Variational Family for GP Classification

The choice of variational family $\mathcal{Q}$ determines both the quality of approximation and computational tractability. For GP classification, we use Gaussian variational distributions.

Full Gaussian Variational Posterior:

$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$$

where $\boldsymbol{\mu} \in \mathbb{R}^n$ and $\Sigma \in \mathbb{R}^{n \times n}$ (symmetric positive definite) are variational parameters.

Parameter Count:

Mean: $n$ parameters
Full covariance: $n(n+1)/2$ parameters
Total: $O(n^2)$

This is prohibitive for large $n$. Common restrictions:

Diagonal covariance: $\Sigma = \text{diag}(\sigma_1^2, ..., \sigma_n^2)$ — $O(n)$ parameters
Low-rank + diagonal: $\Sigma = LL^\top + D$ — $O(nk)$ parameters for rank $k$
Inducing point structure: Factor through $m \ll n$ inducing points — $O(m^2)$ parameters

Computing the KL Term:

For the GP prior $p(\mathbf{f} | X) = \mathcal{N}(\mathbf{f} | \mathbf{m}, K)$ and variational posterior $q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$:

$$\text{KL}(q ,||, p) = \frac{1}{2}\left[\text{tr}(K^{-1}\Sigma) + (\boldsymbol{\mu} - \mathbf{m})^\top K^{-1}(\boldsymbol{\mu} - \mathbf{m}) - n + \log|K| - \log|\Sigma|\right]$$

This is the KL divergence between two Gaussians and has the closed form above.

Computing the Expected Log-Likelihood:

For classification with independent likelihoods:

$$\mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] = \sum{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)]$$

Each term is a one-dimensional integral:

$$\mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] = \int \log p(y_i | f_i) \cdot \mathcal{N}(f_i | \mu_i, \sigma_i^2) , df_i$$

For logistic likelihood, this integral is intractable but can be approximated via Gauss-Hermite quadrature or Monte Carlo sampling.

Mean-Field vs. Full Covariance

The 'mean-field' approximation uses diagonal Σ, assuming posterior independence between latent values. This is computationally cheap but ignores correlations. For GP classification, correlations matter! Full or low-rank covariance is often worth the extra cost, or use inducing point methods that naturally capture correlations.

Optimizing the ELBO

Maximizing the ELBO with respect to variational parameters requires computing gradients. The KL term has a closed-form gradient. The expected log-likelihood is the challenge.

The Monte Carlo Gradient Estimator:

A naive approach: sample $f^{(s)} \sim q(\mathbf{f})$ and estimate:

$$\nabla_\phi \mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] \approx \frac{1}{S}\sum{s=1}^S \log p(\mathbf{y} | f^{(s)}) \nabla_\phi \log q(f^{(s)})$$

This is the score function estimator (REINFORCE). It's unbiased but has high variance.

The Reparameterization Trick:

A lower-variance alternative: express samples as a deterministic transformation of a parameter-free random variable.

For $q(\mathbf{f}) = \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ with $\Sigma = LL^\top$ (Cholesky):

$$\mathbf{f} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$$

Now the gradient:

$$\nabla_\phi \mathbb{E}q[\log p(\mathbf{y} | \mathbf{f})] = \mathbb{E}{\boldsymbol{\epsilon}}\left[\nabla_\phi \log p(\mathbf{y} | \boldsymbol{\mu} + L\boldsymbol{\epsilon})\right]$$

where the gradient can be computed via automatic differentiation through the log-likelihood.

variational_gpc_optimization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Normal, kl_divergence
 
class VariationalGPC(nn.Module):
    """
    Variational Gaussian Process Classification.
    Uses reparameterization trick for gradient estimation.
    """
    def __init__(self, n_train, kernel_fn, X_train, y_train):
        super().__init__()
        self.n = n_train
        self.X = X_train
        self.y = y_train
        
        # Precompute kernel matrix
        self.K = torch.tensor(kernel_fn(X_train, X_train), dtype=torch.float32)
        self.K_chol = torch.linalg.cholesky(self.K + 1e-6 * torch.eye(self.n))
        
        # Variational parameters (using whitened parameterization)
        # q(f) = N(μ, Σ) where μ = L_K @ m_white, Σ = L_K @ S_white @ L_K^T
        self.m_white = nn.Parameter(torch.zeros(n_train))
        self.L_white_diag = nn.Parameter(torch.zeros(n_train))  # log of diagonal
        
    def get_variational_distribution(self):
        """Compute variational mean and covariance."""
        # Diagonal covariance in whitened space
        L_white = torch.diag(torch.exp(self.L_white_diag))
        
        # Transform to original space
        mu = self.K_chol @ self.m_white
        # Σ = L_K @ S_white @ L_K^T where S_white = L_white @ L_white^T
        L_cov = self.K_chol @ L_white
        
        return mu, L_cov
    
    def elbo(self, n_samples=10):
        """
        Compute the ELBO using reparameterization trick.
        
        ELBO = E_q[log p(y|f)] - KL(q(f) || p(f))
        """
        mu, L_cov = self.get_variational_distribution()
        
        # KL divergence: KL(N(μ, Σ) || N(0, K))
        # In whitened parameterization, this simplifies to:
        # KL = 0.5 * (tr(S_white) + m_white^T m_white - n - log|S_white|)
        S_white_diag = torch.exp(2 * self.L_white_diag)
        kl = 0.5 * (S_white_diag.sum() + (self.m_white**2).sum() 
                    - self.n - 2 * self.L_white_diag.sum())
        
        # Expected log-likelihood via Monte Carlo
        epsilon = torch.randn(n_samples, self.n)
        f_samples = mu.unsqueeze(0) + epsilon @ L_cov.T  # (n_samples, n)
        
        # Logistic log-likelihood
        y_expanded = self.y.unsqueeze(0).expand(n_samples, -1)
        log_likelihood = (y_expanded * torch.log(torch.sigmoid(f_samples)) + 
                         (1 - y_expanded) * torch.log(1 - torch.sigmoid(f_samples)))
        expected_ll = log_likelihood.sum(dim=1).mean()  # Average over samples
        
        elbo = expected_ll - kl
        return elbo
    
    def fit(self, n_iter=500, lr=0.1):
        """Optimize the ELBO."""
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        
        for i in range(n_iter):
            optimizer.zero_grad()
            loss = -self.elbo(n_samples=10)  # Negative ELBO for minimization
            loss.backward()
            optimizer.step()
            
            if (i + 1) % 100 == 0:
                print(f"Iter {i+1}: ELBO = {-loss.item():.4f}")
        
        return self
    
    def predict(self, X_test, kernel_fn, n_samples=100):
        """Predict class probabilities at test points."""
        with torch.no_grad():
            mu, L_cov = self.get_variational_distribution()
            
            # Test kernel vectors
            K_star = torch.tensor(kernel_fn(X_test, self.X.numpy()), dtype=torch.float32)
            K_star_star = torch.tensor(
                np.diag(kernel_fn(X_test, X_test)), dtype=torch.float32
            )
            
            # Predictive distribution
            # μ* = K_*^T K^{-1} μ
            alpha = torch.linalg.solve(self.K, mu)
            mu_star = K_star @ alpha
            
            # σ*² = k(x*, x*) - K_*^T K^{-1} K_* + K_*^T K^{-1} Σ K^{-1} K_*
            v = torch.linalg.solve(self.K_chol, K_star.T)
            var_reduction = (v**2).sum(dim=0)
            
            # Add variance from q(f)
            beta = torch.linalg.solve(self.K, L_cov)
            var_addition = ((K_star @ beta)**2).sum(dim=1)
            
            sigma_star_sq = K_star_star - var_reduction + var_addition
            sigma_star = torch.sqrt(torch.clamp(sigma_star_sq, min=1e-6))
            
            # Monte Carlo for class probability
            epsilon = torch.randn(n_samples, len(X_test))
            f_samples = mu_star.unsqueeze(0) + epsilon * sigma_star.unsqueeze(0)
            pi_star = torch.sigmoid(f_samples).mean(dim=0)
            
            return pi_star.numpy(), mu_star.numpy(), sigma_star.numpy()

Whitened Parameterization

The code uses a 'whitened' parameterization where we optimize in a space where the prior is N(0, I). This improves optimization by making the geometry more uniform. The transformation is: f = L_K @ f_white where L_K is the Cholesky of K.

Sparse Variational GP Classification

The full variational approach requires $O(n^2)$ parameters and $O(n^3)$ computation for the KL term. For large datasets, this is prohibitive. Sparse variational methods introduce $m \ll n$ inducing points that summarize the posterior.

The Inducing Point Framework:

Introduce $m$ inducing inputs $Z = {\mathbf{z}_1, ..., \mathbf{z}_m}$ and corresponding inducing variables $\mathbf{u} = [f(\mathbf{z}_1), ..., f(\mathbf{z}_m)]^\top$.

The key identity: for a GP, conditioned on $\mathbf{u}$, the training latents $\mathbf{f}$ are:

$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | K_{fu}K_{uu}^{-1}\mathbf{u}, , K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$$

where $K_{fu}$ is the kernel between training and inducing points.

The Variational Approximation:

Place a variational distribution only on the inducing variables:

$$q(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{m}, S)$$

The variational distribution on $\mathbf{f}$ is then:

$$q(\mathbf{f}) = \int p(\mathbf{f} | \mathbf{u}) q(\mathbf{u}) d\mathbf{u}$$

This integral is tractable (convolution of Gaussians):

$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | A\mathbf{m}, , K_{ff} - A(K_{uu} - S)A^\top)$$

where $A = K_{fu}K_{uu}^{-1}$.

Sparse vs. Full Variational Complexity
Aspect	Full Variational	Sparse Variational
Variational parameters	O(n²)	O(m²)
KL computation	O(n³)	O(m³)
Expected log-likelihood	O(n)	O(nm²)
Total per iteration	O(n³)	O(nm² + m³)
Memory	O(n²)	O(nm + m²)

The Sparse ELBO:

$$\mathcal{L}\text{sparse} = \sum{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) ,||, p(\mathbf{u}))$$

where $p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | 0, K_{uu})$ is the inducing point prior.

The KL term is between $m$-dimensional Gaussians (tractable). Each expected log-likelihood term integrates over a univariate Gaussian marginal.

Inducing Point Placement:

The inducing points $Z$ can be:

Fixed: Grid, k-means centers, or random subset of training data
Learned: Optimize $Z$ alongside variational parameters as part of ELBO maximization

Learning inducing locations often significantly improves performance, but increases optimization complexity.

The Variational Gap

Sparse variational GPs have an additional source of approximation error: the inducing point approximation itself. With m points, we can only represent m degrees of freedom in the posterior. The sparse ELBO lower-bounds the full ELBO, which lower-bounds the true log-evidence. Quality depends crucially on m and inducing point placement.

Stochastic Optimization for Large Scale

The sparse ELBO decomposes as a sum over data points, enabling stochastic optimization with mini-batches.

Mini-batch ELBO:

For a mini-batch $\mathcal{B} \subset {1, ..., n}$ of size $b$:

$$\mathcal{L}\text{batch} = \frac{n}{b}\sum{i \in \mathcal{B}} \mathbb{E}_{q(f_i)}[\log p(y_i | f_i)] - \text{KL}(q(\mathbf{u}) ,||, p(\mathbf{u}))$$

The $n/b$ factor scales the mini-batch expected log-likelihood to estimate the full sum.

Algorithm: Stochastic Variational GPC:

Initialize variational parameters $(\mathbf{m}, S)$ and inducing locations $Z$
For each iteration:
- Sample mini-batch $\mathcal{B}$
- Compute $K_{bu}$ (batch-to-inducing kernel) and $K_{uu}$
- Compute mini-batch ELBO and gradients
- Update parameters with Adam/SGD
Repeat until convergence

sparse_variational_gpc.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
import torch
import torch.nn as nn
 
class SparseVariationalGPC(nn.Module):
    """
    Sparse Variational Gaussian Process Classification.
    Scales to large datasets via inducing points and mini-batching.
    """
    def __init__(self, X_train, y_train, n_inducing=50, kernel_params=None):
        super().__init__()
        self.n = len(y_train)
        self.X = torch.tensor(X_train, dtype=torch.float32)
        self.y = torch.tensor(y_train, dtype=torch.float32)
        
        # Kernel parameters (learnable)
        self.log_lengthscale = nn.Parameter(torch.tensor(0.0))
        self.log_variance = nn.Parameter(torch.tensor(0.0))
        
        # Inducing points (learnable locations)
        # Initialize with k-means or random subset
        indices = np.random.choice(len(X_train), n_inducing, replace=False)
        self.Z = nn.Parameter(torch.tensor(X_train[indices], dtype=torch.float32))
        self.m = n_inducing
        
        # Variational parameters for q(u)
        self.q_mu = nn.Parameter(torch.zeros(n_inducing))
        self.q_L = nn.Parameter(torch.eye(n_inducing))  # Cholesky of q(u) covariance
        
    def kernel(self, X1, X2):
        """RBF kernel with learnable parameters."""
        lengthscale = torch.exp(self.log_lengthscale)
        variance = torch.exp(self.log_variance)
        
        # Squared Euclidean distance
        dist_sq = torch.cdist(X1 / lengthscale, X2 / lengthscale, p=2)**2
        return variance * torch.exp(-0.5 * dist_sq)
    
    def elbo(self, batch_indices=None, n_samples=5):
        """
        Compute the sparse variational ELBO.
        """
        if batch_indices is None:
            batch_indices = torch.arange(self.n)
        
        batch_size = len(batch_indices)
        X_batch = self.X[batch_indices]
        y_batch = self.y[batch_indices]
        
        # Compute kernel matrices
        K_uu = self.kernel(self.Z, self.Z) + 1e-6 * torch.eye(self.m)
        K_bu = self.kernel(X_batch, self.Z)
        
        # Cholesky of K_uu
        L_uu = torch.linalg.cholesky(K_uu)
        
        # q(u) covariance: S = L_q @ L_q^T
        L_q = torch.tril(self.q_L)
        S = L_q @ L_q.T
        
        # KL divergence: KL(q(u) || p(u))
        # KL = 0.5 * (tr(K_uu^{-1} S) + μ^T K_uu^{-1} μ - m + log|K_uu| - log|S|)
        alpha = torch.linalg.solve(L_uu, self.q_mu)
        beta = torch.linalg.solve(L_uu, L_q)
        
        kl = 0.5 * (
            (beta**2).sum() +  # tr(K_uu^{-1} S)
            (alpha**2).sum() -  # μ^T K_uu^{-1} μ
            self.m +
            2 * torch.log(torch.diag(L_uu)).sum() -  # log|K_uu|
            2 * torch.log(torch.diag(L_q).abs()).sum()  # log|S|
        )
        
        # Compute q(f) for batch: mean and variance
        # A = K_bu @ K_uu^{-1}
        A = torch.linalg.solve(L_uu.T, torch.linalg.solve(L_uu, K_bu.T)).T
        
        # q(f) mean: A @ q_mu
        f_mean = A @ self.q_mu
        
        # q(f) variance: k(x,x) - A K_uu A^T + A S A^T (diagonal only)
        K_bb_diag = torch.exp(self.log_variance) * torch.ones(batch_size)
        v = torch.linalg.solve(L_uu, K_bu.T)
        var_reduction = (v**2).sum(dim=0)
        w = A @ L_q
        var_addition = (w**2).sum(dim=1)
        f_var = K_bb_diag - var_reduction + var_addition
        f_std = torch.sqrt(torch.clamp(f_var, min=1e-6))
        
        # Expected log-likelihood via reparameterization
        epsilon = torch.randn(n_samples, batch_size)
        f_samples = f_mean.unsqueeze(0) + epsilon * f_std.unsqueeze(0)
        
        # Logistic log-likelihood
        y_exp = y_batch.unsqueeze(0).expand(n_samples, -1)
        log_lik = (y_exp * torch.log(torch.sigmoid(f_samples) + 1e-8) +
                   (1 - y_exp) * torch.log(1 - torch.sigmoid(f_samples) + 1e-8))
        
        # Scale by n/batch_size to estimate full sum
        expected_ll = log_lik.sum(dim=1).mean() * (self.n / batch_size)
        
        elbo = expected_ll - kl
        return elbo
    
    def fit(self, n_iter=1000, batch_size=100, lr=0.01):
        """Stochastic optimization of ELBO."""
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        
        for i in range(n_iter):
            optimizer.zero_grad()
            
            # Sample mini-batch
            batch_indices = torch.randperm(self.n)[:batch_size]
            
            loss = -self.elbo(batch_indices, n_samples=5)
            loss.backward()
            optimizer.step()
            
            if (i + 1) % 200 == 0:
                with torch.no_grad():
                    full_elbo = self.elbo(n_samples=10)
                print(f"Iter {i+1}: ELBO ≈ {full_elbo.item():.4f}")
        
        return self

Comparison: VI vs. Laplace vs. EP

GP Classification Inference Methods Comparison
Aspect	Laplace	EP	Variational
Approximation type	Mode + Hessian	Moment matching	ELBO optimization
KL direction	N/A (Taylor)	Inclusive (p\|\|q)	Exclusive (q\|\|p)
Convergence	Guaranteed	Not guaranteed	To local optimum
Variance estimate	Often too small	Good	Controlled by family
Scalability	O(n³)	O(n³)	O(nm²) with inducing
Stochastic training	No	No	Yes (mini-batches)
Implementation	Simple	Complex	Moderate with autodiff
Library support	GPy, sklearn	GPy, GPflow	GPflow, GPyTorch

When to Use Each Method:

Scenario	Recommended Method	Reason
Small dataset (n < 1000)	EP	Best accuracy
Medium dataset (1K-10K)	Laplace or VI	Good trade-off
Large dataset (n > 10K)	Sparse VI	Only scalable option
Need hyperparameter optimization	VI	Clean gradient flow
Need uncertainty calibration	EP or VI	Laplace underestimates
Rapid prototyping	Laplace	Simplest to implement

Modern Practice

In modern GP software (GPflow, GPyTorch), sparse variational inference is the default for classification. It seamlessly handles large datasets, provides a unified framework for hyperparameter learning, and integrates naturally with deep learning frameworks for hybrid GP-neural network models.

Connections to Deep Learning

Variational GP classification has deep connections to modern deep learning, particularly through the Variational Autoencoder (VAE) framework.

Shared Structure:

Both VAE and variational GPC optimize an ELBO:

$$\mathcal{L} = \mathbb{E}_{q}[\log p(y | z)] - \text{KL}(q(z) || p(z))$$

Component	VAE	Variational GPC
Latent variable	$z$ (learned representation)	$f$ (function values)
Likelihood	Bernoulli/Gaussian	Bernoulli (sigmoid)
Prior	$\mathcal{N}(0, I)$	$\mathcal{N}(0, K)$ (GP)
Variational family	$\mathcal{N}(\mu(x), \sigma(x)^2)$	$\mathcal{N}(\mu, \Sigma)$
Correlation structure	Optional	From kernel

Key Difference:

In VAEs, the prior is fixed ($\mathcal{N}(0, I)$) and the goal is to learn a latent representation. In GPC, the prior (through the kernel) encodes rich structural assumptions about the function.

Deep Kernel Learning:

A powerful hybrid combines neural networks with GPs:

$$k(x, x') = k_\text{base}(g_\theta(x), g_\theta(x'))$$

where $g_\theta$ is a neural network that learns feature representations, and $k_\text{base}$ is a standard kernel (e.g., RBF) in the learned feature space.

Variational inference enables end-to-end training:

Neural network parameters $\theta$
Kernel hyperparameters
Variational parameters

all optimized jointly by maximizing the ELBO.

Automatic Differentiation:

Modern GP libraries leverage PyTorch/TensorFlow autodiff:

Define the model as a computational graph
Compute ELBO forward pass
Call backward() for gradients
Update all parameters with optimizer

This unifies GP inference with the broader deep learning ecosystem.

The GP-Deep Learning Synergy

Neural networks excel at learning features from raw data. GPs excel at principled uncertainty quantification. Combining them via deep kernels gives the best of both worlds: powerful feature learning with calibrated uncertainty—crucial for safety-critical applications like medical diagnosis or autonomous systems.

Summary: Variational Inference for GPC

Variational inference provides a flexible, scalable framework for GP classification that integrates seamlessly with modern deep learning tools.

Key Takeaways

•ELBO optimization — VI casts inference as maximizing a lower bound on the evidence, simultaneously approximating the posterior and estimating the marginal likelihood.
•Reparameterization trick — Enables low-variance gradient estimation by expressing samples as deterministic functions of noise.
•Sparse approximations — Inducing points reduce complexity from O(n³) to O(nm²), enabling large-scale classification.
•Stochastic optimization — Mini-batch training scales to arbitrary dataset sizes, limited only by inducing point count.
•Flexible variational families — From diagonal to full covariance, trade off accuracy against computation.
•Deep learning integration — Seamless combination with neural networks for deep kernel learning and hybrid architectures.

What's Next:

The final page of this module extends GP classification to multi-class problems. We'll see how the binary classification framework generalizes, explore the softmax likelihood, and understand the computational challenges of multi-output GPs.

Page Complete

You now understand variational inference for GP classification—from the ELBO derivation, through sparse approximations with inducing points, to stochastic optimization for large-scale problems. This framework powers modern GP libraries and enables integration with deep learning.