Loading content...
Gaussian Process regression sits at a remarkable intersection of mathematical elegance and practical utility. Unlike most machine learning models where inference requires approximations, iterative optimization, or sampling, GP regression admits a closed-form posterior—we can write down the exact answer. This is extraordinarily rare and valuable.
The key insight enabling this closed-form solution is that the posterior of a Gaussian Process, conditioned on Gaussian-distributed observations, is itself a Gaussian Process. This conjugacy property means we can compute posterior mean and covariance functions analytically, giving us not just predictions but complete probability distributions over functions.
In this page, we derive this posterior from first principles. The mathematics involves conditioning multivariate Gaussian distributions—a fundamental operation that appears throughout probabilistic machine learning. Mastering this derivation provides deep insight into why GPs work and how to extend them.
By the end of this page, you will: (1) Understand how to condition a Gaussian Process on observed data, (2) Derive the posterior mean and covariance functions analytically, (3) Recognize the connection between GP posterior computation and linear algebra, (4) Appreciate why this closed-form solution enables practical GP applications.
Before diving into the posterior derivation, let us establish the setting precisely. We have:
Training Data: A set of $n$ input-output pairs $\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ where each $\mathbf{x}_i \in \mathbb{R}^d$ is a $d$-dimensional feature vector and $y_i \in \mathbb{R}$ is the corresponding observation.
GP Prior: We assume a Gaussian Process prior over functions $f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$, where:
Observation Model: For now, we assume noise-free observations, meaning $y_i = f(\mathbf{x}_i)$ exactly. We will relax this in later sections.
Goal: Given test inputs $\mathbf{X}* = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}_^{(m)})$, compute the posterior distribution $p(\mathbf{f}* | \mathbf{X}*, \mathbf{X}, \mathbf{y})$ over function values at these test points.
Throughout this derivation: X (bold capital) denotes the n×d matrix of training inputs with rows x₁, ..., xₙ. y (bold lowercase) is the n×1 vector of training outputs. X* denotes test inputs, f* denotes function values at test inputs. K denotes covariance matrices with subscripts indicating which points: K(X,X), K(X*,X), etc.
The fundamental property of Gaussian Processes is the marginalization property: any finite collection of function values follows a multivariate Gaussian distribution. This means if we pick $n + m$ points (our $n$ training points plus $m$ test points), the function values at these points are jointly Gaussian.
Let $\mathbf{f} = [f(\mathbf{x}1), \ldots, f(\mathbf{x}n)]^\top$ denote function values at training points, and $\mathbf{f}* = [f(\mathbf{x}^{(1)}), \ldots, f(\mathbf{x}_^{(m)})]^\top$ denote function values at test points. The joint distribution is:
$$\begin{bmatrix} \mathbf{f} \ \mathbf{f}* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{m}(\mathbf{X}) \ \mathbf{m}(\mathbf{X}) \end{bmatrix}, \begin{bmatrix} K(\mathbf{X}, \mathbf{X}) & K(\mathbf{X}, \mathbf{X}_) \ K(\mathbf{X}*, \mathbf{X}) & K(\mathbf{X}, \mathbf{X}_) \end{bmatrix} \right)$$
Here:
Think of this joint Gaussian as defining a probability distribution over all possible configurations of function values at both training and test points simultaneously. The off-diagonal block K(X, X*) captures how knowledge at training points informs our beliefs at test points—this is the correlation that enables prediction.
The Block Structure Matters
The block structure of the joint covariance matrix encodes crucial information:
Top-left block $K(\mathbf{X}, \mathbf{X})$: Prior uncertainty about function values at training locations. Diagonal entries represent variance at each point; off-diagonal entries capture covariance between points.
Bottom-right block $K(\mathbf{X}*, \mathbf{X}*)$: Prior uncertainty at test locations before seeing any data.
Off-diagonal blocks $K(\mathbf{X}, \mathbf{X}*)$ and $K(\mathbf{X}*, \mathbf{X})$: The key to GP regression! These blocks tell us how function values at test points correlate with function values at training points. Strong correlation means observing training data will substantially inform our predictions.
When the kernel decays with distance (as in the RBF kernel), test points far from all training points will have small entries in the cross-covariance block, meaning our predictions there remain uncertain.
Now comes the heart of GP inference: conditioning the joint Gaussian on observed data. This is where the magic of Gaussians becomes apparent—conditioning a Gaussian random vector on a subset of its components yields another Gaussian, with closed-form mean and covariance.
The General Conditioning Formula
For a multivariate Gaussian partitioned as: $$\begin{bmatrix} \mathbf{a} \ \mathbf{b} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu}a \ \boldsymbol{\mu}b \end{bmatrix}, \begin{bmatrix} \Sigma{aa} & \Sigma{ab} \ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right)$$
The conditional distribution of $\mathbf{b}$ given $\mathbf{a} = \mathbf{a}0$ is: $$\mathbf{b} | \mathbf{a} = \mathbf{a}0 \sim \mathcal{N}(\boldsymbol{\mu}{b|a}, \Sigma{b|a})$$
where:
Notice that the conditional covariance Σ_{b|a} does NOT depend on the observed value a₀ — only on which variables were observed. The conditional mean DOES depend on what we observed. This means: (1) Prediction uncertainty depends only on input locations, not output values. (2) The predicted function values depend on the actual observations.
Intuition Behind the Conditioning Formula
The conditional mean formula has a beautiful interpretation:
$$\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \underbrace{\Sigma{ba} \Sigma{aa}^{-1}}_{\text{regression coefficients}} \underbrace{(\mathbf{a}_0 - \boldsymbol{\mu}a)}{\text{deviation from prior}}$$
The conditional covariance formula:
$$\Sigma_{b|a} = \underbrace{\Sigma_{bb}}{\text{prior uncertainty}} - \underbrace{\Sigma{ba} \Sigma_{aa}^{-1} \Sigma_{ab}}_{\text{variance reduction}}$$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np def gaussian_conditional(mu_a, mu_b, sigma_aa, sigma_ab, sigma_bb, a_observed): """ Compute conditional distribution p(b | a = a_observed) for jointly Gaussian random vectors [a, b]. Parameters: mu_a: Prior mean of a (n_a,) mu_b: Prior mean of b (n_b,) sigma_aa: Covariance of a (n_a, n_a) sigma_ab: Cross-covariance between a and b (n_a, n_b) sigma_bb: Covariance of b (n_b, n_b) a_observed: Observed value of a (n_a,) Returns: mu_b_given_a: Conditional mean (n_b,) sigma_b_given_a: Conditional covariance (n_b, n_b) """ # For numerical stability, use Cholesky decomposition and solve # Instead of explicitly computing inverse L = np.linalg.cholesky(sigma_aa) # sigma_aa = L @ L.T # Solve L @ alpha_1 = (a_observed - mu_a), then L.T @ alpha = alpha_1 # This computes sigma_aa^{-1} @ (a_observed - mu_a) alpha = np.linalg.solve(L.T, np.linalg.solve(L, a_observed - mu_a)) # Conditional mean: mu_b + sigma_ba @ sigma_aa^{-1} @ (a - mu_a) # Note: sigma_ba = sigma_ab.T mu_b_given_a = mu_b + sigma_ab.T @ alpha # Solve for sigma_aa^{-1} @ sigma_ab = V V = np.linalg.solve(L.T, np.linalg.solve(L, sigma_ab)) # Conditional covariance: sigma_bb - sigma_ba @ sigma_aa^{-1} @ sigma_ab sigma_b_given_a = sigma_bb - sigma_ab.T @ V return mu_b_given_a, sigma_b_given_a # Example: Two-dimensional Gaussian# Joint distribution of [a, b]mu = np.array([0.0, 0.0]) # Prior meanssigma = np.array([ [1.0, 0.8], # Var(a)=1, Cov(a,b)=0.8 [0.8, 1.0] # Cov(b,a)=0.8, Var(b)=1]) # Observe a = 2.0a_obs = np.array([2.0]) # Compute conditional b | a=2mu_b_cond, sigma_b_cond = gaussian_conditional( mu_a=mu[:1], mu_b=mu[1:], sigma_aa=sigma[:1, :1], sigma_ab=sigma[:1, 1:], sigma_bb=sigma[1:, 1:], a_observed=a_obs) print(f"p(b | a=2) = N({mu_b_cond[0]:.3f}, {sigma_b_cond[0,0]:.3f})")# Output: p(b | a=2) = N(1.600, 0.360)# Interpretation: High correlation (0.8) means observing a=2 shifts # our expectation of b to 1.6, and reduces variance from 1.0 to 0.36We now apply the Gaussian conditioning formula to our GP setting. In the noise-free case where $\mathbf{y} = \mathbf{f}$ (observations equal true function values), we condition the joint distribution on the observed training outputs.
Identifying terms with the conditioning formula:
Assuming zero prior mean ($\mathbf{m}(\mathbf{x}) = 0$), the GP posterior is:
$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_, \Sigma_*)$$
where:
$$\boxed{\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{y}}$$
$$\boxed{\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K(\mathbf{X}_, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} K(\mathbf{X}, \mathbf{X}_*)}$$
These two equations are the heart of Gaussian Process regression. The posterior mean μ* is a linear combination of the training outputs y, weighted by the kernel similarity to test points. The posterior covariance Σ* starts from prior uncertainty K(X*,X*) and subtracts the information gained from training data.
Interpreting the Posterior Mean
The posterior mean can be rewritten as:
$$\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) \boldsymbol{\alpha}, \quad \text{where } \boldsymbol{\alpha} = K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{y}$$
This reveals that the GP posterior mean is a linear combination of kernel functions centered at training points:
$$\mu_(\mathbf{x}_) = \sum_{i=1}^{n} \alpha_i , k(\mathbf{x}_*, \mathbf{x}_i)$$
The coefficients $\boldsymbol{\alpha}$ are determined by solving a linear system. This representation connects GPs to kernel methods: the posterior mean lies in the reproducing kernel Hilbert space (RKHS) spanned by kernel functions centered at training points.
Interpreting the Posterior Covariance
The posterior covariance has the form:
$$\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K(\mathbf{X}_, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} K(\mathbf{X}, \mathbf{X}_*)$$
Key insights:
| Test Location | Posterior Mean | Posterior Variance | Interpretation |
|---|---|---|---|
| At training point xᵢ | Exactly yᵢ | Zero | Perfect interpolation |
| Near training points | Weighted average of nearby yᵢ | Low | High confidence predictions |
| Between training points | Smooth interpolation | Moderate | Kernel-guided interpolation |
| Far from all training data | Prior mean (0) | Prior variance | Reverts to prior beliefs |
While the posterior equations are elegant, naive implementation invites numerical disaster. Computing $K(\mathbf{X}, \mathbf{X})^{-1}$ directly is:
The Cholesky Decomposition Approach
The numerically stable approach uses the Cholesky decomposition: $$K(\mathbf{X}, \mathbf{X}) = \mathbf{L} \mathbf{L}^\top$$
where $\mathbf{L}$ is a lower triangular matrix. This decomposition:
To compute the posterior mean: $$\boldsymbol{\alpha} = K^{-1} \mathbf{y} = (\mathbf{L} \mathbf{L}^\top)^{-1} \mathbf{y} = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{y}$$
We solve:
Then: $\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) \boldsymbol{\alpha}$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0): """Compute RBF (Squared Exponential) kernel matrix.""" # Pairwise squared distances sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1) return variance * np.exp(-0.5 * sq_dist / length_scale**2) def gp_posterior(X_train, y_train, X_test, kernel_fn, jitter=1e-8): """ Compute GP posterior mean and variance at test points. Parameters: X_train: Training inputs (n, d) y_train: Training outputs (n,) X_test: Test inputs (m, d) kernel_fn: Kernel function k(X1, X2) -> (n1, n2) matrix jitter: Small value added to diagonal for numerical stability Returns: mu_post: Posterior mean at test points (m,) var_post: Posterior variance at test points (m,) cov_post: Full posterior covariance matrix (m, m) """ n = X_train.shape[0] # Compute kernel matrices K = kernel_fn(X_train, X_train) # (n, n) K_star = kernel_fn(X_test, X_train) # (m, n) K_star_star = kernel_fn(X_test, X_test) # (m, m) # Add jitter for numerical stability K = K + jitter * np.eye(n) # Cholesky decomposition: K = L @ L.T L = cholesky(K, lower=True) # (n, n) lower triangular # Solve L @ z = y (forward substitution) z = solve_triangular(L, y_train, lower=True) # (n,) # Solve L.T @ alpha = z (backward substitution) alpha = solve_triangular(L.T, z, lower=False) # (n,) # Posterior mean: K_star @ alpha = K_star @ K^{-1} @ y mu_post = K_star @ alpha # (m,) # For posterior covariance, solve L @ V.T = K_star.T V = solve_triangular(L, K_star.T, lower=True).T # (m, n) # Full posterior covariance: K_** - K_* @ K^{-1} @ K_*^T cov_post = K_star_star - V @ V.T # (m, m) # Posterior variance (diagonal of covariance) var_post = np.diag(cov_post) # (m,) return mu_post, var_post, cov_post # Example usagenp.random.seed(42) # Training data: 5 noisy observations of sin functionX_train = np.array([[-4.0], [-3.0], [-1.0], [1.0], [2.0]])y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(5) # Test points: dense gridX_test = np.linspace(-5, 5, 100).reshape(-1, 1) # Get posteriormu, var, cov = gp_posterior( X_train, y_train, X_test, kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0, variance=1.0)) print(f"Posterior mean range: [{mu.min():.3f}, {mu.max():.3f}]")print(f"Posterior variance range: [{var.min():.3f}, {var.max():.3f}]") # Verify interpolation at training pointsX_at_train = rbf_kernel(X_train, X_train, 1.0, 1.0)mu_at_train, var_at_train, _ = gp_posterior( X_train, y_train, X_train, kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, 1.0, 1.0))print(f"\nMax absolute error at training points: {np.max(np.abs(mu_at_train - y_train)):.2e}")print(f"Max variance at training points: {np.max(var_at_train):.2e}")Adding a small 'jitter' to the diagonal (typically 10⁻⁶ to 10⁻⁸) ensures numerical stability. This effectively assumes tiny observation noise, preventing issues when training points are very close together or when the kernel matrix becomes nearly singular.
For completeness, let us derive the Gaussian conditioning formula from scratch using density manipulation. This provides deeper insight into why the formula takes its particular form.
Step 1: Write the Joint Density
For jointly Gaussian vectors $\mathbf{a}$ and $\mathbf{b}$, the joint density is:
$$p(\mathbf{a}, \mathbf{b}) = \frac{1}{(2\pi)^{(n_a+n_b)/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} \begin{bmatrix} \mathbf{a} - \boldsymbol{\mu}_a \ \mathbf{b} - \boldsymbol{\mu}_b \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} \mathbf{a} - \boldsymbol{\mu}_a \ \mathbf{b} - \boldsymbol{\mu}_b \end{bmatrix} \right)$$
Step 2: Block Inverse Using Schur Complement
The inverse of the block covariance matrix can be written using the Schur complement. Let $S = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$ (the Schur complement of $\Sigma_{aa}$). Then:
$$\Sigma^{-1} = \begin{bmatrix} \Sigma_{aa}^{-1} + \Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \Sigma_{ba} \Sigma_{aa}^{-1} & -\Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \ -S^{-1} \Sigma_{ba} \Sigma_{aa}^{-1} & S^{-1} \end{bmatrix}$$
Step 3: Expand the Quadratic Form
Define $\delta_a = \mathbf{a} - \boldsymbol{\mu}_a$ and $\delta_b = \mathbf{b} - \boldsymbol{\mu}_b$. The exponent in the joint density is:
$$Q = \begin{bmatrix} \delta_a \ \delta_b \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} \delta_a \ \delta_b \end{bmatrix}$$
Expanding using the block inverse:
$$Q = \delta_a^\top \Sigma_{aa}^{-1} \delta_a + \delta_a^\top \Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a - 2 \delta_a^\top \Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \delta_b + \delta_b^\top S^{-1} \delta_b$$
Step 4: Complete the Square in $\delta_b$
Grouping terms involving $\delta_b$:
$$Q = [\text{terms in } \delta_a \text{ only}] + (\delta_b - \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a)^\top S^{-1} (\delta_b - \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a)$$
Step 5: Apply Bayes' Rule
$$p(\mathbf{b} | \mathbf{a}) = \frac{p(\mathbf{a}, \mathbf{b})}{p(\mathbf{a})} \propto \exp\left( -\frac{1}{2} (\mathbf{b} - \boldsymbol{\mu}{b|a})^\top S^{-1} (\mathbf{b} - \boldsymbol{\mu}{b|a}) \right)$$
where: $$\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \Sigma{ba} \Sigma{aa}^{-1} (\mathbf{a} - \boldsymbol{\mu}_a)$$
This is a Gaussian with mean $\boldsymbol{\mu}{b|a}$ and covariance $S = \Sigma{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$. QED.
The Schur complement S = Σ_{bb} - Σ_{ba}Σ_{aa}⁻¹Σ_{ab} appears naturally as the conditional covariance. This identity is fundamental in linear algebra and appears throughout probabilistic modeling, Kalman filtering, and structured matrix computations.
We have so far assumed zero prior mean for simplicity, but the extension to arbitrary mean functions is straightforward.
With prior mean function $m(\mathbf{x})$, the joint distribution becomes:
$$\begin{bmatrix} \mathbf{f} \ \mathbf{f}* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{m} \ \mathbf{m}* \end{bmatrix}, \begin{bmatrix} K & K_* \ K_*^\top & K_{**} \end{bmatrix} \right)$$
where $\mathbf{m} = [m(\mathbf{x}1), \ldots, m(\mathbf{x}n)]^\top$ and $\mathbf{m}* = [m(\mathbf{x}^{(1)}), \ldots, m(\mathbf{x}_^{(m)})]^\top$.
Applying the conditioning formula:
$$\boldsymbol{\mu}* = \mathbf{m}* + K_*^\top K^{-1} (\mathbf{y} - \mathbf{m})$$
$$\Sigma_* = K_{**} - K_^\top K^{-1} K_$$
Non-zero mean functions are useful when you have prior knowledge about the function's general trend. Common choices include: (1) Constant mean m(x) = c for known baseline levels, (2) Linear mean m(x) = β₀ + β₁ᵀx for known linear trends, (3) Parametric means for domain-specific prior knowledge. The GP then learns deviations from this mean.
Practical Implications
The choice of mean function affects predictions, especially in regions far from training data:
| Mean Function | Behavior Far From Data | Use Case |
|---|---|---|
| Zero mean $m(\mathbf{x}) = 0$ | Predictions revert to 0 | When 0 is a reasonable default |
| Constant mean $m(\mathbf{x}) = c$ | Predictions revert to $c$ | Known baseline/average |
| Linear mean $m(\mathbf{x}) = \boldsymbol{\beta}^\top \mathbf{x}$ | Predictions follow linear trend | Known overall trend |
| Prior from simpler model | Combines simple + complex model | Transfer learning |
For many applications, zero mean is acceptable because the kernel already encodes substantial prior knowledge. However, for extrapolation-heavy problems, the mean function becomes critical since the GP reverts to it beyond the training data.
We have derived the complete posterior distribution for Gaussian Process regression. Let us consolidate the key results:
What's Next
We have derived the posterior distribution, but we haven't yet discussed how to use it for prediction. In the next page, we explore the predictive distribution—how to make probabilistic predictions at new test points and quantify prediction uncertainty. We will also see how the posterior covariance structure enables principled uncertainty quantification that distinguishes GPs from most other regression methods.
You now understand how to derive the Gaussian Process posterior from first principles. This closed-form posterior—unique among flexible non-parametric methods—is what makes GP regression both mathematically elegant and computationally tractable for small to medium datasets.