Machine LearningGP Classification

Gaussian Process Classification

LevelAdvanced

Duration90 mins

TopicGP Classification

1 / 5

Non-Gaussian Likelihoods

The Classification Challenge for Gaussian Processes

Gaussian Processes provide an elegant framework for regression, offering not just predictions but principled uncertainty quantification through closed-form posterior distributions. But what happens when our target variable is no longer continuous? What if we need to classify emails as spam or not-spam, diagnose patients as healthy or diseased, or predict whether a financial transaction is fraudulent?

The fundamental obstacle: Classification outputs are discrete—typically binary labels in {0, 1} or class probabilities in [0, 1]. The Gaussian likelihood that made GP regression so tractable is fundamentally incompatible with these outputs. We cannot assume that class labels are generated by adding Gaussian noise to a latent function. This incompatibility shatters the beautiful closed-form solutions we derived for regression and forces us into the realm of approximate inference.

What You Will Learn

By the end of this page, you will understand: (1) Why Gaussian likelihoods fail for classification, (2) How the logistic (Bernoulli) and probit likelihoods model class probabilities, (3) The concept of latent functions and the squashing function that maps them to probabilities, (4) Why non-Gaussian likelihoods break posterior conjugacy, and (5) The mathematical structure that all GP classification methods must address.

From Regression to Classification

Let's first recall why GP regression works so elegantly, then identify exactly where classification breaks this framework.

GP Regression Recap:

In GP regression, we model observations as:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where $f \sim \mathcal{GP}(m, k)$ is our latent function. The key property is that:

$$p(y_i | f_i) = \mathcal{N}(y_i | f_i, \sigma_n^2)$$

This Gaussian likelihood, combined with the Gaussian process prior, yields a Gaussian posterior over the latent function. The product of two Gaussians is Gaussian—this is the conjugacy that enables closed-form inference.

The Classification Setup:

For binary classification with labels $y \in {0, 1}$, we need:

$$p(y_i = 1 | f_i) = \sigma(f_i)$$

where $\sigma: \mathbb{R} \to [0, 1]$ is a squashing function (also called a response function or inverse link function) that maps the unbounded latent function value to a valid probability.

The Conjugacy Breakage

The classification likelihood p(y|f) is Bernoulli, not Gaussian. The product of a Gaussian prior p(f) with a Bernoulli likelihood p(y|f) does NOT yield a Gaussian posterior. This non-conjugacy means we cannot compute the posterior p(f|y) in closed form—all GP classification methods are fundamentally approximate inference techniques.

The Latent Function Interpretation:

We can think of classification as a two-stage process:

Latent function evaluation: Draw $f_i = f(\mathbf{x}_i)$ from the GP
Probabilistic classification: Map $f_i$ through squashing function $\sigma$ to get class probability

The latent function $f$ represents an underlying 'propensity' or 'log-odds' toward the positive class. Large positive values of $f$ indicate high probability of class 1; large negative values indicate high probability of class 0.

This interpretation is powerful: we maintain the GP's modeling of correlations between inputs through the kernel, while the squashing function handles the mapping to valid probabilities.

The Logistic (Bernoulli) Likelihood

The most common choice for binary classification is the logistic sigmoid function:

$$\sigma(f) = \frac{1}{1 + e^{-f}} = \frac{e^f}{1 + e^f}$$

This yields the Bernoulli likelihood:

$$p(y | f) = \sigma(f)^y \cdot (1 - \sigma(f))^{1-y} = \text{Bernoulli}(y | \sigma(f))$$

Or equivalently, writing it as a single expression:

$$p(y | f) = \sigma(yf) \quad \text{where } y \in {-1, +1}$$

(Note: The second form uses labels in ${-1, +1}$ rather than ${0, 1}$, which simplifies many derivations.)

Properties of the Logistic Sigmoid
Property	Mathematical Form	Interpretation
Domain → Range	$\sigma: \mathbb{R} \to (0, 1)$	Maps any real value to valid probability
Symmetry	$\sigma(-f) = 1 - \sigma(f)$	Symmetric around origin; $\sigma(0) = 0.5$
Derivative	$\sigma'(f) = \sigma(f)(1 - \sigma(f))$	Maximum at $f=0$; vanishes as $\|f\| \to \infty$
Log-odds	$f = \log\frac{p}{1-p}$	Latent function equals log-odds ratio
Saturation	$\sigma(f) \to 0, 1$ as $f \to \mp\infty$	Extreme values give confident predictions

Why Logistic Sigmoid?

The logistic function arises naturally from several perspectives:

Maximum entropy: Among all distributions over ${0, 1}$ with given mean, the Bernoulli-logistic model maximizes entropy.
Exponential family: The logistic function is the canonical link for the Bernoulli exponential family.
Log-odds interpretation: The latent function $f$ directly represents the log-odds $\log(p/(1-p))$, making the model interpretable.
Asymptotic behavior: Smooth saturation at extremes provides numerical stability.

The Full Likelihood:

For a dataset $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^n$ and latent function values $\mathbf{f} = [f_1, ..., f_n]^\top$:

$$p(\mathbf{y} | \mathbf{f}) = \prod_{i=1}^n p(y_i | f_i) = \prod_{i=1}^n \sigma(f_i)^{y_i}(1-\sigma(f_i))^{1-y_i}$$

This factorizes across data points because, conditioned on $f$, observations are independent.

Numerical Stability

In practice, always use the log-likelihood: $\log p(y|f) = y \log \sigma(f) + (1-y) \log(1-\sigma(f))$. Use the numerically stable form: $\log \sigma(f) = -\log(1 + e^{-f})$ for $f > 0$ and $\log \sigma(f) = f - \log(1 + e^f)$ for $f < 0$.

The Probit Likelihood

An alternative to the logistic sigmoid is the probit function—the cumulative distribution function (CDF) of the standard normal distribution:

$$\Phi(f) = \int_{-\infty}^f \mathcal{N}(z | 0, 1) , dz = \frac{1}{2}\left[1 + \text{erf}\left(\frac{f}{\sqrt{2}}\right)\right]$$

The probit likelihood is:

$$p(y | f) = \Phi(f)^y \cdot (1 - \Phi(f))^{1-y}$$

Latent Variable Interpretation:

The probit model has an elegant latent variable interpretation. Consider:

$$y = \mathbf{1}[f + \epsilon > 0], \quad \epsilon \sim \mathcal{N}(0, 1)$$

Then: $$p(y = 1 | f) = p(f + \epsilon > 0) = p(\epsilon > -f) = 1 - \Phi(-f) = \Phi(f)$$

This interpretation is particularly useful for multi-class extensions and connects to classical statistical models.

Logistic Advantages

•Closed-form sigmoid function
•Log-odds interpretation
•Heavier tails (more robust to outliers)
•Standard in deep learning/neural networks
•Faster to compute

Probit Advantages

•Latent variable interpretation
•Natural multi-class extension
•Connects to classical statistics
•Analytical properties for some integrals
•Expectation Propagation convergence

Comparing Logistic and Probit:

Both functions are monotonically increasing, symmetric around 0, and map $\mathbb{R} \to (0, 1)$. They are nearly indistinguishable for moderate values of $f$ (approximately $\sigma(f) \approx \Phi(1.7 \cdot f)$).

The key difference is in the tails: the logistic function has heavier tails (slower decay to 0 and 1), making it slightly more robust to outliers. The probit function's lighter tails mean that extreme observations have more influence on the latent function.

Practical Guidance:

Use logistic for most applications (computationally simpler, widely understood)
Use probit when you need the latent variable interpretation, or for theoretical analysis
The choice rarely matters in practice—focus on the kernel design and inference method instead

Why Posterior Inference is Hard

Now we arrive at the central challenge of GP classification. By Bayes' theorem, the posterior over latent function values is:

$$p(\mathbf{f} | \mathbf{y}, X) = \frac{p(\mathbf{y} | \mathbf{f}) \cdot p(\mathbf{f} | X)}{p(\mathbf{y} | X)}$$

where:

$p(\mathbf{f} | X) = \mathcal{N}(\mathbf{f} | \mathbf{m}, K)$ is the GP prior (Gaussian)
$p(\mathbf{y} | \mathbf{f}) = \prod_i \sigma(f_i)^{y_i}(1-\sigma(f_i))^{1-y_i}$ is the likelihood (non-Gaussian)
$p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) d\mathbf{f}$ is the marginal likelihood (intractable)

The Fundamental Intractability

The posterior p(f|y,X) is the product of a Gaussian (prior) and a product of sigmoids (likelihood). This product is NOT Gaussian. The normalizing constant (marginal likelihood) requires integrating this product over all f ∈ ℝⁿ—an integral with no closed-form solution. Every GP classification method addresses this intractability through approximation.

Visualizing the Problem:

For a single latent variable $f$, consider:

$$p(f | y, x) \propto \underbrace{\sigma(f)^y (1-\sigma(f))^{1-y}}{\text{Bernoulli likelihood}} \cdot \underbrace{\mathcal{N}(f | m, k)}{\text{GP prior}}$$

The prior is a Gaussian (bell curve). The likelihood is a sigmoid (S-curve) or its complement. Their product is skewed and asymmetric—manifestly non-Gaussian.

For $n$ data points, we face an $n$-dimensional integral. Even for modest $n$, direct numerical integration becomes computationally infeasible.

The Two Intractable Computations:

Posterior computation: Finding $p(\mathbf{f} | \mathbf{y}, X)$ for the training points
Predictive distribution: Finding $p(f_* | \mathbf{y}, X)$ for new test points

Both require integrating out the latent function values, which cannot be done analytically.

posterior_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
 
def sigmoid(f):
    return 1 / (1 + np.exp(-f))
 
def probit(f):
    return norm.cdf(f)
 
# Single dimension illustration
f = np.linspace(-5, 5, 1000)
 
# Prior: N(0, 1)
prior = norm.pdf(f, 0, 1)
 
# Likelihood for y=1 (class 1)
likelihood_logistic = sigmoid(f)
likelihood_probit = probit(f)
 
# Unnormalized posterior (y=1)
posterior_logistic = likelihood_logistic * prior
posterior_probit = likelihood_probit * prior
 
# The posterior is clearly non-Gaussian: skewed and asymmetric
# To get the true posterior, we'd need to normalize (integrate),
# but the result would still be non-Gaussian in shape.
 
# Key insight: The posterior shifts toward positive f values
# because y=1 is more likely when f > 0
print("Posterior mode is shifted from prior mean (0) toward positive f")
print(f"Logistic posterior mode: ~{f[np.argmax(posterior_logistic)]:.2f}")
print(f"Probit posterior mode: ~{f[np.argmax(posterior_probit)]:.2f}")

The Prediction Problem

Even if we could compute the posterior $p(\mathbf{f} | \mathbf{y}, X)$, making predictions at a new test point $\mathbf{x}_*$ involves additional integration.

Step 1: Latent Predictive Distribution

First, we need the distribution of the latent function at the test point:

$$p(f_* | \mathbf{y}, X, \mathbf{x}*) = \int p(f* | \mathbf{f}, X, \mathbf{x}_*) \cdot p(\mathbf{f} | \mathbf{y}, X) , d\mathbf{f}$$

The first term is the GP predictive given training latents—this is Gaussian: $$p(f_* | \mathbf{f}, X, \mathbf{x}*) = \mathcal{N}(f* | \mu_, \sigma_^2)$$

with: $$\mu_* = m(\mathbf{x}*) + \mathbf{k}^\top K^{-1}(\mathbf{f} - \mathbf{m})$$ $$\sigma_^2 = k(\mathbf{x}*, \mathbf{x}) - \mathbf{k}_^\top K^{-1} \mathbf{k}_*$$

But averaging this Gaussian over the non-Gaussian posterior $p(\mathbf{f} | \mathbf{y}, X)$ has no closed form.

Step 2: Class Probability Prediction

Once we have (or approximate) the latent predictive distribution $p(f_* | \mathbf{y}, X, \mathbf{x}_*)$, the class probability is:

$$p(y_* = 1 | \mathbf{y}, X, \mathbf{x}*) = \int \sigma(f) \cdot p(f_ | \mathbf{y}, X, \mathbf{x}*) , df*$$

This is a one-dimensional integral, which is more tractable—but still requires knowledge of the latent predictive distribution.

Special Case: Probit Likelihood with Gaussian Approximation

If we approximate $p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \mathcal{N}(\mu, \sigma_^2)$, and use the probit likelihood:

$$\int \Phi(f_) \cdot \mathcal{N}(f_ | \mu_, \sigma_^2) , df_* = \Phi\left(\frac{\mu_}{\sqrt{1 + \sigma_^2}}\right)$$

This closed-form solution for probit is one reason it's sometimes preferred for theoretical analysis. For logistic, this integral must be approximated numerically.

The Uncertainty Trade-off

Notice the denominator √(1 + σ²*) in the probit prediction formula. Higher uncertainty (larger σ²*) pushes the prediction toward 0.5. This is exactly what we want: uncertain predictions should be less confident. This automatic calibration is a key advantage of the full Bayesian treatment in GP classification.

Overview of Approximation Strategies

Since exact inference is impossible, the GP classification literature has developed several approximation strategies. Each makes different trade-offs between accuracy, computational cost, and implementation complexity.

The Core Approaches:

All methods approximate the intractable posterior $p(\mathbf{f} | \mathbf{y}, X)$ with a tractable distribution $q(\mathbf{f})$. The differences lie in how $q$ is chosen and optimized.

GP Classification Inference Methods
Method	Approximation Type	Key Idea	Covered In
Laplace Approximation	Gaussian at MAP	Fit Gaussian at posterior mode using second-order Taylor expansion	Page 2
Expectation Propagation	Moment matching	Iteratively refine Gaussian approximation to match marginal moments	Page 3
Variational Inference	ELBO optimization	Find best Gaussian approximation by maximizing evidence lower bound	Page 4
MCMC Sampling	Monte Carlo	Draw samples from true posterior; most accurate but expensive	Advanced topic

Why Gaussian Approximations?

Almost all practical GP classification methods approximate the posterior with a Gaussian:

$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$$

This choice is not arbitrary:

Computational tractability: Gaussian distributions are closed under marginalization, conditioning, and linear transformation. This makes predictions tractable.
Natural extension of regression: The true posterior in GP regression is Gaussian. Using Gaussian approximations for classification maintains algorithmic similarity.
Parameterization: A Gaussian over $n$ training points requires $O(n)$ mean parameters and $O(n^2)$ covariance parameters—expensive, but feasible.
Uncertainty propagation: Gaussian approximations naturally propagate uncertainty through the prediction pipeline.

The Question of Quality:

The different methods (Laplace, EP, VI) correspond to different ways of fitting the parameters $(\boldsymbol{\mu}, \Sigma)$. They differ in:

How well they capture the true posterior's location and spread
Computational cost per iteration
Convergence properties
Quality of marginal likelihood approximation (for hyperparameter learning)

The Unified Mathematical Framework

To unify our treatment of GP classification methods, we establish the common mathematical framework that all approaches share.

The Model:

$$\mathbf{f} | X \sim \mathcal{N}(\mathbf{m}, K) \quad \text{(GP prior)}$$ $$y_i | f_i \stackrel{\text{ind}}{\sim} \text{Bernoulli}(\sigma(f_i)) \quad \text{(conditional independence)}$$

The Inference Goals:

Approximate the posterior: Find $q(\mathbf{f}) \approx p(\mathbf{f} | \mathbf{y}, X)$
Approximate the marginal likelihood: Estimate $p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) d\mathbf{f}$ for hyperparameter optimization
Make predictions: Compute $p(y_* = 1 | \mathbf{y}, X, \mathbf{x}_*)$ for new inputs

Key Quantities:

Define the log-likelihood and its derivatives:

$$\psi_i(f_i) = \log p(y_i | f_i)$$ $$\nabla \psi_i = \frac{\partial \psi_i}{\partial f_i}$$ $$\nabla^2 \psi_i = \frac{\partial^2 \psi_i}{\partial f_i^2}$$

For the Logistic Likelihood:

$$\psi_i = y_i \log \sigma(f_i) + (1-y_i) \log(1-\sigma(f_i))$$

Using the identity $\sigma'(f) = \sigma(f)(1-\sigma(f))$:

$$\nabla \psi_i = y_i - \sigma(f_i) = y_i - \pi_i$$

$$\nabla^2 \psi_i = -\sigma(f_i)(1-\sigma(f_i)) = -\pi_i(1-\pi_i)$$

where $\pi_i = \sigma(f_i)$ is the predicted probability.

Key Observations:

The gradient $\nabla \psi_i$ is the residual between the label and predicted probability
The Hessian $\nabla^2 \psi_i$ is always negative (log-likelihood is concave)
The Hessian magnitude is maximized when $\pi_i = 0.5$ and minimized at extremes

These properties ensure the log-posterior is concave (since the log of a Gaussian is quadratic), which means the Laplace approximation finds a unique global maximum.

Concavity is Critical

The concavity of the log-posterior guarantees that gradient-based optimization will find the unique global mode. This is NOT guaranteed for all likelihoods—some robust classification likelihoods (e.g., t-distribution based) may have multiple modes, complicating inference.

Summary: Non-Gaussian Likelihoods

We have established the fundamental challenge that defines all of GP classification: the incompatibility between Gaussian process priors and classification likelihoods.

Key Takeaways

•Classification requires squashing functions — We map latent GP values to [0,1] through sigmoid (logistic) or probit functions to obtain valid class probabilities.
•Non-conjugacy breaks closed-form inference — Unlike regression, the product of Gaussian prior and Bernoulli likelihood is non-Gaussian, making exact posterior computation impossible.
•Two likelihoods dominate practice — Logistic (computational simplicity, log-odds interpretation) and probit (analytical properties, latent variable interpretation) are nearly interchangeable.
•All methods approximate — Laplace, EP, and variational inference all approximate the true posterior with a Gaussian, differing in how they fit the approximation.
•Concavity ensures unique modes — The log-posterior is concave, guaranteeing optimization-based methods find the global optimum.
•Uncertainty propagates to predictions — The full Bayesian treatment automatically attenuates confident predictions when latent uncertainty is high.

What's Next:

With the problem clearly defined, we now turn to solutions. The next page introduces the Laplace approximation—the simplest and most widely-used method for GP classification. We'll derive the algorithm, understand its geometric interpretation, and learn when it works well and when more sophisticated methods are needed.

Page Complete

You now understand why GP classification requires approximate inference: the non-Gaussian Bernoulli likelihood breaks the conjugacy that made GP regression tractable. This challenge—and the Gaussian approximations that address it—will be the theme throughout this module.

1 / 5

Loading learning content...

Machine LearningGP Classification

Gaussian Process Classification

LevelAdvanced

Duration90 mins

TopicGP Classification

1 / 5

Non-Gaussian Likelihoods

The Classification Challenge for Gaussian Processes

What You Will Learn

From Regression to Classification

Let's first recall why GP regression works so elegantly, then identify exactly where classification breaks this framework.

GP Regression Recap:

In GP regression, we model observations as:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where $f \sim \mathcal{GP}(m, k)$ is our latent function. The key property is that:

$$p(y_i | f_i) = \mathcal{N}(y_i | f_i, \sigma_n^2)$$

The Classification Setup:

For binary classification with labels $y \in {0, 1}$, we need:

$$p(y_i = 1 | f_i) = \sigma(f_i)$$

where $\sigma: \mathbb{R} \to [0, 1]$ is a squashing function (also called a response function or inverse link function) that maps the unbounded latent function value to a valid probability.

The Conjugacy Breakage

The Latent Function Interpretation:

We can think of classification as a two-stage process:

Latent function evaluation: Draw $f_i = f(\mathbf{x}_i)$ from the GP
Probabilistic classification: Map $f_i$ through squashing function $\sigma$ to get class probability

This interpretation is powerful: we maintain the GP's modeling of correlations between inputs through the kernel, while the squashing function handles the mapping to valid probabilities.

The Logistic (Bernoulli) Likelihood

The most common choice for binary classification is the logistic sigmoid function:

$$\sigma(f) = \frac{1}{1 + e^{-f}} = \frac{e^f}{1 + e^f}$$

This yields the Bernoulli likelihood:

$$p(y | f) = \sigma(f)^y \cdot (1 - \sigma(f))^{1-y} = \text{Bernoulli}(y | \sigma(f))$$

Or equivalently, writing it as a single expression:

$$p(y | f) = \sigma(yf) \quad \text{where } y \in {-1, +1}$$

(Note: The second form uses labels in ${-1, +1}$ rather than ${0, 1}$, which simplifies many derivations.)

Properties of the Logistic Sigmoid
Property	Mathematical Form	Interpretation
Domain → Range	$\sigma: \mathbb{R} \to (0, 1)$	Maps any real value to valid probability
Symmetry	$\sigma(-f) = 1 - \sigma(f)$	Symmetric around origin; $\sigma(0) = 0.5$
Derivative	$\sigma'(f) = \sigma(f)(1 - \sigma(f))$	Maximum at $f=0$; vanishes as $\|f\| \to \infty$
Log-odds	$f = \log\frac{p}{1-p}$	Latent function equals log-odds ratio
Saturation	$\sigma(f) \to 0, 1$ as $f \to \mp\infty$	Extreme values give confident predictions

Why Logistic Sigmoid?

The logistic function arises naturally from several perspectives:

Maximum entropy: Among all distributions over ${0, 1}$ with given mean, the Bernoulli-logistic model maximizes entropy.
Exponential family: The logistic function is the canonical link for the Bernoulli exponential family.
Log-odds interpretation: The latent function $f$ directly represents the log-odds $\log(p/(1-p))$, making the model interpretable.
Asymptotic behavior: Smooth saturation at extremes provides numerical stability.

The Full Likelihood:

For a dataset $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^n$ and latent function values $\mathbf{f} = [f_1, ..., f_n]^\top$:

$$p(\mathbf{y} | \mathbf{f}) = \prod_{i=1}^n p(y_i | f_i) = \prod_{i=1}^n \sigma(f_i)^{y_i}(1-\sigma(f_i))^{1-y_i}$$

This factorizes across data points because, conditioned on $f$, observations are independent.

Numerical Stability

The Probit Likelihood

An alternative to the logistic sigmoid is the probit function—the cumulative distribution function (CDF) of the standard normal distribution:

$$\Phi(f) = \int_{-\infty}^f \mathcal{N}(z | 0, 1) , dz = \frac{1}{2}\left[1 + \text{erf}\left(\frac{f}{\sqrt{2}}\right)\right]$$

The probit likelihood is:

$$p(y | f) = \Phi(f)^y \cdot (1 - \Phi(f))^{1-y}$$

Latent Variable Interpretation:

The probit model has an elegant latent variable interpretation. Consider:

$$y = \mathbf{1}[f + \epsilon > 0], \quad \epsilon \sim \mathcal{N}(0, 1)$$

Then: $$p(y = 1 | f) = p(f + \epsilon > 0) = p(\epsilon > -f) = 1 - \Phi(-f) = \Phi(f)$$

This interpretation is particularly useful for multi-class extensions and connects to classical statistical models.

Logistic Advantages

•Closed-form sigmoid function
•Log-odds interpretation
•Heavier tails (more robust to outliers)
•Standard in deep learning/neural networks
•Faster to compute

Probit Advantages

•Latent variable interpretation
•Natural multi-class extension
•Connects to classical statistics
•Analytical properties for some integrals
•Expectation Propagation convergence

Comparing Logistic and Probit:

Practical Guidance:

Use logistic for most applications (computationally simpler, widely understood)
Use probit when you need the latent variable interpretation, or for theoretical analysis
The choice rarely matters in practice—focus on the kernel design and inference method instead

Why Posterior Inference is Hard

Now we arrive at the central challenge of GP classification. By Bayes' theorem, the posterior over latent function values is:

$$p(\mathbf{f} | \mathbf{y}, X) = \frac{p(\mathbf{y} | \mathbf{f}) \cdot p(\mathbf{f} | X)}{p(\mathbf{y} | X)}$$

where:

$p(\mathbf{f} | X) = \mathcal{N}(\mathbf{f} | \mathbf{m}, K)$ is the GP prior (Gaussian)
$p(\mathbf{y} | \mathbf{f}) = \prod_i \sigma(f_i)^{y_i}(1-\sigma(f_i))^{1-y_i}$ is the likelihood (non-Gaussian)
$p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) d\mathbf{f}$ is the marginal likelihood (intractable)

The Fundamental Intractability

Visualizing the Problem:

For a single latent variable $f$, consider:

$$p(f | y, x) \propto \underbrace{\sigma(f)^y (1-\sigma(f))^{1-y}}{\text{Bernoulli likelihood}} \cdot \underbrace{\mathcal{N}(f | m, k)}{\text{GP prior}}$$

The prior is a Gaussian (bell curve). The likelihood is a sigmoid (S-curve) or its complement. Their product is skewed and asymmetric—manifestly non-Gaussian.

For $n$ data points, we face an $n$-dimensional integral. Even for modest $n$, direct numerical integration becomes computationally infeasible.

The Two Intractable Computations:

Posterior computation: Finding $p(\mathbf{f} | \mathbf{y}, X)$ for the training points
Predictive distribution: Finding $p(f_* | \mathbf{y}, X)$ for new test points

Both require integrating out the latent function values, which cannot be done analytically.

posterior_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
 
def sigmoid(f):
    return 1 / (1 + np.exp(-f))
 
def probit(f):
    return norm.cdf(f)
 
# Single dimension illustration
f = np.linspace(-5, 5, 1000)
 
# Prior: N(0, 1)
prior = norm.pdf(f, 0, 1)
 
# Likelihood for y=1 (class 1)
likelihood_logistic = sigmoid(f)
likelihood_probit = probit(f)
 
# Unnormalized posterior (y=1)
posterior_logistic = likelihood_logistic * prior
posterior_probit = likelihood_probit * prior
 
# The posterior is clearly non-Gaussian: skewed and asymmetric
# To get the true posterior, we'd need to normalize (integrate),
# but the result would still be non-Gaussian in shape.
 
# Key insight: The posterior shifts toward positive f values
# because y=1 is more likely when f > 0
print("Posterior mode is shifted from prior mean (0) toward positive f")
print(f"Logistic posterior mode: ~{f[np.argmax(posterior_logistic)]:.2f}")
print(f"Probit posterior mode: ~{f[np.argmax(posterior_probit)]:.2f}")

The Prediction Problem

Even if we could compute the posterior $p(\mathbf{f} | \mathbf{y}, X)$, making predictions at a new test point $\mathbf{x}_*$ involves additional integration.

Step 1: Latent Predictive Distribution

First, we need the distribution of the latent function at the test point:

$$p(f_* | \mathbf{y}, X, \mathbf{x}*) = \int p(f* | \mathbf{f}, X, \mathbf{x}_*) \cdot p(\mathbf{f} | \mathbf{y}, X) , d\mathbf{f}$$

The first term is the GP predictive given training latents—this is Gaussian: $$p(f_* | \mathbf{f}, X, \mathbf{x}*) = \mathcal{N}(f* | \mu_, \sigma_^2)$$

with: $$\mu_* = m(\mathbf{x}*) + \mathbf{k}^\top K^{-1}(\mathbf{f} - \mathbf{m})$$ $$\sigma_^2 = k(\mathbf{x}*, \mathbf{x}) - \mathbf{k}_^\top K^{-1} \mathbf{k}_*$$

But averaging this Gaussian over the non-Gaussian posterior $p(\mathbf{f} | \mathbf{y}, X)$ has no closed form.

Step 2: Class Probability Prediction

Once we have (or approximate) the latent predictive distribution $p(f_* | \mathbf{y}, X, \mathbf{x}_*)$, the class probability is:

$$p(y_* = 1 | \mathbf{y}, X, \mathbf{x}*) = \int \sigma(f) \cdot p(f_ | \mathbf{y}, X, \mathbf{x}*) , df*$$

This is a one-dimensional integral, which is more tractable—but still requires knowledge of the latent predictive distribution.

Special Case: Probit Likelihood with Gaussian Approximation

If we approximate $p(f_* | \mathbf{y}, X, \mathbf{x}*) \approx \mathcal{N}(\mu, \sigma_^2)$, and use the probit likelihood:

$$\int \Phi(f_) \cdot \mathcal{N}(f_ | \mu_, \sigma_^2) , df_* = \Phi\left(\frac{\mu_}{\sqrt{1 + \sigma_^2}}\right)$$

This closed-form solution for probit is one reason it's sometimes preferred for theoretical analysis. For logistic, this integral must be approximated numerically.

The Uncertainty Trade-off

Overview of Approximation Strategies

The Core Approaches:

All methods approximate the intractable posterior $p(\mathbf{f} | \mathbf{y}, X)$ with a tractable distribution $q(\mathbf{f})$. The differences lie in how $q$ is chosen and optimized.

GP Classification Inference Methods
Method	Approximation Type	Key Idea	Covered In
Laplace Approximation	Gaussian at MAP	Fit Gaussian at posterior mode using second-order Taylor expansion	Page 2
Expectation Propagation	Moment matching	Iteratively refine Gaussian approximation to match marginal moments	Page 3
Variational Inference	ELBO optimization	Find best Gaussian approximation by maximizing evidence lower bound	Page 4
MCMC Sampling	Monte Carlo	Draw samples from true posterior; most accurate but expensive	Advanced topic

Why Gaussian Approximations?

Almost all practical GP classification methods approximate the posterior with a Gaussian:

$$q(\mathbf{f}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \Sigma)$$

This choice is not arbitrary:

Computational tractability: Gaussian distributions are closed under marginalization, conditioning, and linear transformation. This makes predictions tractable.
Natural extension of regression: The true posterior in GP regression is Gaussian. Using Gaussian approximations for classification maintains algorithmic similarity.
Parameterization: A Gaussian over $n$ training points requires $O(n)$ mean parameters and $O(n^2)$ covariance parameters—expensive, but feasible.
Uncertainty propagation: Gaussian approximations naturally propagate uncertainty through the prediction pipeline.

The Question of Quality:

The different methods (Laplace, EP, VI) correspond to different ways of fitting the parameters $(\boldsymbol{\mu}, \Sigma)$. They differ in:

How well they capture the true posterior's location and spread
Computational cost per iteration
Convergence properties
Quality of marginal likelihood approximation (for hyperparameter learning)

The Unified Mathematical Framework

To unify our treatment of GP classification methods, we establish the common mathematical framework that all approaches share.

The Model:

$$\mathbf{f} | X \sim \mathcal{N}(\mathbf{m}, K) \quad \text{(GP prior)}$$ $$y_i | f_i \stackrel{\text{ind}}{\sim} \text{Bernoulli}(\sigma(f_i)) \quad \text{(conditional independence)}$$

The Inference Goals:

Approximate the posterior: Find $q(\mathbf{f}) \approx p(\mathbf{f} | \mathbf{y}, X)$
Approximate the marginal likelihood: Estimate $p(\mathbf{y} | X) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | X) d\mathbf{f}$ for hyperparameter optimization
Make predictions: Compute $p(y_* = 1 | \mathbf{y}, X, \mathbf{x}_*)$ for new inputs

Key Quantities:

Define the log-likelihood and its derivatives:

$$\psi_i(f_i) = \log p(y_i | f_i)$$ $$\nabla \psi_i = \frac{\partial \psi_i}{\partial f_i}$$ $$\nabla^2 \psi_i = \frac{\partial^2 \psi_i}{\partial f_i^2}$$

For the Logistic Likelihood:

$$\psi_i = y_i \log \sigma(f_i) + (1-y_i) \log(1-\sigma(f_i))$$

Using the identity $\sigma'(f) = \sigma(f)(1-\sigma(f))$:

$$\nabla \psi_i = y_i - \sigma(f_i) = y_i - \pi_i$$

$$\nabla^2 \psi_i = -\sigma(f_i)(1-\sigma(f_i)) = -\pi_i(1-\pi_i)$$

where $\pi_i = \sigma(f_i)$ is the predicted probability.

Key Observations:

The gradient $\nabla \psi_i$ is the residual between the label and predicted probability
The Hessian $\nabla^2 \psi_i$ is always negative (log-likelihood is concave)
The Hessian magnitude is maximized when $\pi_i = 0.5$ and minimized at extremes

These properties ensure the log-posterior is concave (since the log of a Gaussian is quadratic), which means the Laplace approximation finds a unique global maximum.

Concavity is Critical

Summary: Non-Gaussian Likelihoods

We have established the fundamental challenge that defines all of GP classification: the incompatibility between Gaussian process priors and classification likelihoods.

Key Takeaways

•Classification requires squashing functions — We map latent GP values to [0,1] through sigmoid (logistic) or probit functions to obtain valid class probabilities.
•Non-conjugacy breaks closed-form inference — Unlike regression, the product of Gaussian prior and Bernoulli likelihood is non-Gaussian, making exact posterior computation impossible.
•Two likelihoods dominate practice — Logistic (computational simplicity, log-odds interpretation) and probit (analytical properties, latent variable interpretation) are nearly interchangeable.
•All methods approximate — Laplace, EP, and variational inference all approximate the true posterior with a Gaussian, differing in how they fit the approximation.
•Concavity ensures unique modes — The log-posterior is concave, guaranteeing optimization-based methods find the global optimum.
•Uncertainty propagates to predictions — The full Bayesian treatment automatically attenuates confident predictions when latent uncertainty is high.

What's Next:

Page Complete

1 / 5