Loading learning content...
Gaussian Processes provide an elegant framework for regression, offering not just predictions but principled uncertainty quantification through closed-form posterior distributions. But what happens when our target variable is no longer continuous? What if we need to classify emails as spam or not-spam, diagnose patients as healthy or diseased, or predict whether a financial transaction is fraudulent?
The fundamental obstacle: Classification outputs are discrete—typically binary labels in {0, 1} or class probabilities in [0, 1]. The Gaussian likelihood that made GP regression so tractable is fundamentally incompatible with these outputs. We cannot assume that class labels are generated by adding Gaussian noise to a latent function. This incompatibility shatters the beautiful closed-form solutions we derived for regression and forces us into the realm of approximate inference.
By the end of this page, you will understand: (1) Why Gaussian likelihoods fail for classification, (2) How the logistic (Bernoulli) and probit likelihoods model class probabilities, (3) The concept of latent functions and the squashing function that maps them to probabilities, (4) Why non-Gaussian likelihoods break posterior conjugacy, and (5) The mathematical structure that all GP classification methods must address.
Let's first recall why GP regression works so elegantly, then identify exactly where classification breaks this framework.
GP Regression Recap:
In GP regression, we model observations as:
$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$
where $f \sim \mathcal{GP}(m, k)$ is our latent function. The key property is that:
$$p(y_i | f_i) = \mathcal{N}(y_i | f_i, \sigma_n^2)$$
This Gaussian likelihood, combined with the Gaussian process prior, yields a Gaussian posterior over the latent function. The product of two Gaussians is Gaussian—this is the conjugacy that enables closed-form inference.
The Classification Setup:
For binary classification with labels $y \in {0, 1}$, we need:
$$p(y_i = 1 | f_i) = \sigma(f_i)$$
where $\sigma: \mathbb{R} \to [0, 1]$ is a squashing function (also called a response function or inverse link function) that maps the unbounded latent function value to a valid probability.
The classification likelihood p(y|f) is Bernoulli, not Gaussian. The product of a Gaussian prior p(f) with a Bernoulli likelihood p(y|f) does NOT yield a Gaussian posterior. This non-conjugacy means we cannot compute the posterior p(f|y) in closed form—all GP classification methods are fundamentally approximate inference techniques.
The Latent Function Interpretation:
We can think of classification as a two-stage process:
The latent function $f$ represents an underlying 'propensity' or 'log-odds' toward the positive class. Large positive values of $f$ indicate high probability of class 1; large negative values indicate high probability of class 0.
This interpretation is powerful: we maintain the GP's modeling of correlations between inputs through the kernel, while the squashing function handles the mapping to valid probabilities.
The most common choice for binary classification is the logistic sigmoid function:
$$\sigma(f) = \frac{1}{1 + e^{-f}} = \frac{e^f}{1 + e^f}$$
This yields the Bernoulli likelihood:
$$p(y | f) = \sigma(f)^y \cdot (1 - \sigma(f))^{1-y} = \text{Bernoulli}(y | \sigma(f))$$
Or equivalently, writing it as a single expression:
$$p(y | f) = \sigma(yf) \quad \text{where } y \in {-1, +1}$$
(Note: The second form uses labels in ${-1, +1}$ rather than ${0, 1}$, which simplifies many derivations.)
| Property | Mathematical Form | Interpretation |
|---|---|---|
| Domain → Range | $\sigma: \mathbb{R} \to (0, 1)$ | Maps any real value to valid probability |
| Symmetry | $\sigma(-f) = 1 - \sigma(f)$ | Symmetric around origin; $\sigma(0) = 0.5$ |
| Derivative | $\sigma'(f) = \sigma(f)(1 - \sigma(f))$ | Maximum at $f=0$; vanishes as $|f| \to \infty$ |
| Log-odds | $f = \log\frac{p}{1-p}$ | Latent function equals log-odds ratio |
| Saturation | $\sigma(f) \to 0, 1$ as $f \to \mp\infty$ | Extreme values give confident predictions |
Why Logistic Sigmoid?
The logistic function arises naturally from several perspectives:
Maximum entropy: Among all distributions over ${0, 1}$ with given mean, the Bernoulli-logistic model maximizes entropy.
Exponential family: The logistic function is the canonical link for the Bernoulli exponential family.
Log-odds interpretation: The latent function $f$ directly represents the log-odds $\log(p/(1-p))$, making the model interpretable.
Asymptotic behavior: Smooth saturation at extremes provides numerical stability.
The Full Likelihood:
For a dataset $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^n$ and latent function values $\mathbf{f} = [f_1, ..., f_n]^\top$:
$$p(\mathbf{y} | \mathbf{f}) = \prod_{i=1}^n p(y_i | f_i) = \prod_{i=1}^n \sigma(f_i)^{y_i}(1-\sigma(f_i))^{1-y_i}$$
This factorizes across data points because, conditioned on $f$, observations are independent.
In practice, always use the log-likelihood: $\log p(y|f) = y \log \sigma(f) + (1-y) \log(1-\sigma(f))$. Use the numerically stable form: $\log \sigma(f) = -\log(1 + e^{-f})$ for $f > 0$ and $\log \sigma(f) = f - \log(1 + e^f)$ for $f < 0$.
An alternative to the logistic sigmoid is the probit function—the cumulative distribution function (CDF) of the standard normal distribution:
$$\Phi(f) = \int_{-\infty}^f \mathcal{N}(z | 0, 1) , dz = \frac{1}{2}\left[1 + \text{erf}\left(\frac{f}{\sqrt{2}}\right)\right]$$
The probit likelihood is:
$$p(y | f) = \Phi(f)^y \cdot (1 - \Phi(f))^{1-y}$$
Latent Variable Interpretation:
The probit model has an elegant latent variable interpretation. Consider:
$$y = \mathbf{1}[f + \epsilon > 0], \quad \epsilon \sim \mathcal{N}(0, 1)$$
Then: $$p(y = 1 | f) = p(f + \epsilon > 0) = p(\epsilon > -f) = 1 - \Phi(-f) = \Phi(f)$$
This interpretation is particularly useful for multi-class extensions and connects to classical statistical models.
Comparing Logistic and Probit:
Both functions are monotonically increasing, symmetric around 0, and map $\mathbb{R} \to (0, 1)$. They are nearly indistinguishable for moderate values of $f$ (approximately $\sigma(f) \approx \Phi(1.7 \cdot f)$).
The key difference is in the tails: the logistic function has heavier tails (slower decay to 0 and 1), making it slightly more robust to outliers. The probit function's lighter tails mean that extreme observations have more influence on the latent function.
Practical Guidance:
Now we arrive at the central challenge of GP classification. By Bayes' theorem, the posterior over latent function values is:
$$p(\mathbf{f} | \mathbf{y}, X) = \frac{p(\mathbf{y} | \mathbf{f}) \cdot p(\mathbf{f} | X)}{p(\mathbf{y} | X)}$$
where:
The posterior p(f|y,X) is the product of a Gaussian (prior) and a product of sigmoids (likelihood). This product is NOT Gaussian. The normalizing constant (marginal likelihood) requires integrating this product over all f ∈ ℝⁿ—an integral with no closed-form solution. Every GP classification method addresses this intractability through approximation.
Visualizing the Problem:
For a single latent variable $f$, consider:
$$p(f | y, x) \propto \underbrace{\sigma(f)^y (1-\sigma(f))^{1-y}}{\text{Bernoulli likelihood}} \cdot \underbrace{\mathcal{N}(f | m, k)}{\text{GP prior}}$$
The prior is a Gaussian (bell curve). The likelihood is a sigmoid (S-curve) or its complement. Their product is skewed and asymmetric—manifestly non-Gaussian.
For $n$ data points, we face an $n$-dimensional integral. Even for modest $n$, direct numerical integration becomes computationally infeasible.
The Two Intractable Computations:
Both require integrating out the latent function values, which cannot be done analytically.
123456789101112131415161718192021222324252627282930313233
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import norm def sigmoid(f): return 1 / (1 + np.exp(-f)) def probit(f): return norm.cdf(f) # Single dimension illustrationf = np.linspace(-5, 5, 1000) # Prior: N(0, 1)prior = norm.pdf(f, 0, 1) # Likelihood for y=1 (class 1)likelihood_logistic = sigmoid(f)likelihood_probit = probit(f) # Unnormalized posterior (y=1)posterior_logistic = likelihood_logistic * priorposterior_probit = likelihood_probit * prior # The posterior is clearly non-Gaussian: skewed and asymmetric# To get the true posterior, we'd need to normalize (integrate),# but the result would still be non-Gaussian in shape. # Key insight: The posterior shifts toward positive f values# because y=1 is more likely when f > 0print("Posterior mode is shifted from prior mean (0) toward positive f")print(f"Logistic posterior mode: ~{f[np.argmax(posterior_logistic)]:.2f}")print(f"Probit posterior mode: ~{f[np.argmax(posterior_probit)]:.2f}")Even if we could compute the posterior $p(\mathbf{f} | \mathbf{y}, X)$, making predictions at a new test point $\mathbf{x}_*$ involves additional integration.
Step 1: Latent Predictive Distribution
First, we need the distribution of the latent function at the test point:
$$p(f_* | \mathbf{y}, X, \mathbf{x}*) = \int p(f* | \mathbf{f}, X, \mathbf{x}_*) \cdot p(\mathbf{f} | \mathbf{y}, X) , d\mathbf{f}$$
The first term is the GP predictive given training latents—this is Gaussian: $$p(f_* | \mathbf{f}, X, \mathbf{x}*) = \mathcal{N}(f* | \mu_, \sigma_^2)$$
with: $$\mu_* = m(\mathbf{x}*) + \mathbf{k}^\top K^{-1}(\mathbf{f} - \mathbf{m})$$ $$\sigma_^2 = k(\mathbf{x}*, \mathbf{x}) - \mathbf{k}_^\top K^{-1} \mathbf{k}_*$$
But averaging this Gaussian over the non-Gaussian posterior $p(\mathbf{f} | \mathbf{y}, X)$ has no closed form.
Step 2: Class Probability Prediction
Once we have (or approximate) the latent predictive distribution $p(f_* | \mathbf{y}, X, \mathbf{x}_*)$, the class probability is:
$$p(y_* = 1 | \mathbf{y}, X, \mathbf{x}*) = \int \sigma(f) \cdot p(f_ | \mathbf{y}, X, \mathbf{x}*) , df*$$
This is a one-dimensional integral, which is more tractable—but still requires knowledge of the latent predictive distribution.
Special Case: Probit Likelihood with Gaussian Approximation
If we approximate $p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \mathcal{N}(\mu, \sigma_^2)$, and use the probit likelihood:
$$\int \Phi(f_) \cdot \mathcal{N}(f_ | \mu_, \sigma_^2) , df_* = \Phi\left(\frac{\mu_}{\sqrt{1 + \sigma_^2}}\right)$$
This closed-form solution for probit is one reason it's sometimes preferred for theoretical analysis. For logistic, this integral must be approximated numerically.
Notice the denominator √(1 + σ²*) in the probit prediction formula. Higher uncertainty (larger σ²*) pushes the prediction toward 0.5. This is exactly what we want: uncertain predictions should be less confident. This automatic calibration is a key advantage of the full Bayesian treatment in GP classification.
Since exact inference is impossible, the GP classification literature has developed several approximation strategies. Each makes different trade-offs between accuracy, computational cost, and implementation complexity.
The Core Approaches:
All methods approximate the intractable posterior $p(\mathbf{f} | \mathbf{y}, X)$ with a tractable distribution $q(\mathbf{f})$. The differences lie in how $q$ is chosen and optimized.
| Method | Approximation Type | Key Idea | Covered In |
|---|---|---|---|
| Laplace Approximation | Gaussian at MAP | Fit Gaussian at posterior mode using second-order Taylor expansion | Page 2 |
| Expectation Propagation | Moment matching | Iteratively refine Gaussian approximation to match marginal moments | Page 3 |
| Variational Inference | ELBO optimization | Find best Gaussian approximation by maximizing evidence lower bound | Page 4 |
| MCMC Sampling | Monte Carlo | Draw samples from true posterior; most accurate but expensive | Advanced topic |
Why Gaussian Approximations?
Almost all practical GP classification methods approximate the posterior with a Gaussian:
$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$$
This choice is not arbitrary:
Computational tractability: Gaussian distributions are closed under marginalization, conditioning, and linear transformation. This makes predictions tractable.
Natural extension of regression: The true posterior in GP regression is Gaussian. Using Gaussian approximations for classification maintains algorithmic similarity.
Parameterization: A Gaussian over $n$ training points requires $O(n)$ mean parameters and $O(n^2)$ covariance parameters—expensive, but feasible.
Uncertainty propagation: Gaussian approximations naturally propagate uncertainty through the prediction pipeline.
The Question of Quality:
The different methods (Laplace, EP, VI) correspond to different ways of fitting the parameters $(\boldsymbol{\mu}, \Sigma)$. They differ in:
To unify our treatment of GP classification methods, we establish the common mathematical framework that all approaches share.
The Model:
$$\mathbf{f} | X \sim \mathcal{N}(\mathbf{m}, K) \quad \text{(GP prior)}$$ $$y_i | f_i \stackrel{\text{ind}}{\sim} \text{Bernoulli}(\sigma(f_i)) \quad \text{(conditional independence)}$$
The Inference Goals:
Approximate the posterior: Find $q(\mathbf{f}) \approx p(\mathbf{f} | \mathbf{y}, X)$
Approximate the marginal likelihood: Estimate $p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) d\mathbf{f}$ for hyperparameter optimization
Make predictions: Compute $p(y_* = 1 | \mathbf{y}, X, \mathbf{x}_*)$ for new inputs
Key Quantities:
Define the log-likelihood and its derivatives:
$$\psi_i(f_i) = \log p(y_i | f_i)$$ $$\nabla \psi_i = \frac{\partial \psi_i}{\partial f_i}$$ $$\nabla^2 \psi_i = \frac{\partial^2 \psi_i}{\partial f_i^2}$$
For the Logistic Likelihood:
$$\psi_i = y_i \log \sigma(f_i) + (1-y_i) \log(1-\sigma(f_i))$$
Using the identity $\sigma'(f) = \sigma(f)(1-\sigma(f))$:
$$\nabla \psi_i = y_i - \sigma(f_i) = y_i - \pi_i$$
$$\nabla^2 \psi_i = -\sigma(f_i)(1-\sigma(f_i)) = -\pi_i(1-\pi_i)$$
where $\pi_i = \sigma(f_i)$ is the predicted probability.
Key Observations:
These properties ensure the log-posterior is concave (since the log of a Gaussian is quadratic), which means the Laplace approximation finds a unique global maximum.
The concavity of the log-posterior guarantees that gradient-based optimization will find the unique global mode. This is NOT guaranteed for all likelihoods—some robust classification likelihoods (e.g., t-distribution based) may have multiple modes, complicating inference.
We have established the fundamental challenge that defines all of GP classification: the incompatibility between Gaussian process priors and classification likelihoods.
What's Next:
With the problem clearly defined, we now turn to solutions. The next page introduces the Laplace approximation—the simplest and most widely-used method for GP classification. We'll derive the algorithm, understand its geometric interpretation, and learn when it works well and when more sophisticated methods are needed.
You now understand why GP classification requires approximate inference: the non-Gaussian Bernoulli likelihood breaks the conjugacy that made GP regression tractable. This challenge—and the Gaussian approximations that address it—will be the theme throughout this module.