Gp Regression - Learning Module

Loading content...

0/245

Posterior Derivation

The Heart of Gaussian Process Inference

Gaussian Process regression sits at a remarkable intersection of mathematical elegance and practical utility. Unlike most machine learning models where inference requires approximations, iterative optimization, or sampling, GP regression admits a closed-form posterior—we can write down the exact answer. This is extraordinarily rare and valuable.

The key insight enabling this closed-form solution is that the posterior of a Gaussian Process, conditioned on Gaussian-distributed observations, is itself a Gaussian Process. This conjugacy property means we can compute posterior mean and covariance functions analytically, giving us not just predictions but complete probability distributions over functions.

In this page, we derive this posterior from first principles. The mathematics involves conditioning multivariate Gaussian distributions—a fundamental operation that appears throughout probabilistic machine learning. Mastering this derivation provides deep insight into why GPs work and how to extend them.

What You Will Learn

By the end of this page, you will: (1) Understand how to condition a Gaussian Process on observed data, (2) Derive the posterior mean and covariance functions analytically, (3) Recognize the connection between GP posterior computation and linear algebra, (4) Appreciate why this closed-form solution enables practical GP applications.

Prerequisites and Problem Setup

Before diving into the posterior derivation, let us establish the setting precisely. We have:

Training Data: A set of $n$ input-output pairs $\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ where each $\mathbf{x}_i \in \mathbb{R}^d$ is a $d$-dimensional feature vector and $y_i \in \mathbb{R}$ is the corresponding observation.

GP Prior: We assume a Gaussian Process prior over functions $f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$, where:

$m(\mathbf{x})$ is the mean function (often assumed zero: $m(\mathbf{x}) = 0$)
$k(\mathbf{x}, \mathbf{x}')$ is the covariance function (kernel), encoding our prior beliefs about function smoothness, periodicity, etc.

Observation Model: For now, we assume noise-free observations, meaning $y_i = f(\mathbf{x}_i)$ exactly. We will relax this in later sections.

Goal: Given test inputs $\mathbf{X}* = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}_^{(m)})$, compute the posterior distribution $p(\mathbf{f}* | \mathbf{X}*, \mathbf{X}, \mathbf{y})$ over function values at these test points.

Notation Convention

Throughout this derivation: X (bold capital) denotes the n×d matrix of training inputs with rows x₁, ..., xₙ. y (bold lowercase) is the n×1 vector of training outputs. X* denotes test inputs, f* denotes function values at test inputs. K denotes covariance matrices with subscripts indicating which points: K(X,X), K(X*,X), etc.

Mathematical Prerequisites

•Multivariate Gaussian Distributions: The joint distribution of multiple random variables, parameterized by mean vector and covariance matrix.
•Gaussian Conditioning: The formula for computing conditional distributions from joint Gaussians—the core operation powering GP inference.
•Positive Definite Matrices: Covariance matrices must be positive semi-definite; kernel functions must satisfy Mercer's condition.
•Matrix Inversion and Cholesky Decomposition: Computational tools for implementing GP inference efficiently.
•Schur Complement: A matrix identity that simplifies conditional covariance computation.

Constructing the Joint Distribution

The fundamental property of Gaussian Processes is the marginalization property: any finite collection of function values follows a multivariate Gaussian distribution. This means if we pick $n + m$ points (our $n$ training points plus $m$ test points), the function values at these points are jointly Gaussian.

Let $\mathbf{f} = [f(\mathbf{x}1), \ldots, f(\mathbf{x}n)]^\top$ denote function values at training points, and $\mathbf{f}* = [f(\mathbf{x}^{(1)}), \ldots, f(\mathbf{x}_^{(m)})]^\top$ denote function values at test points. The joint distribution is:

$$\begin{bmatrix} \mathbf{f} \ \mathbf{f}* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{m}(\mathbf{X}) \ \mathbf{m}(\mathbf{X}) \end{bmatrix}, \begin{bmatrix} K(\mathbf{X}, \mathbf{X}) & K(\mathbf{X}, \mathbf{X}_) \ K(\mathbf{X}*, \mathbf{X}) & K(\mathbf{X}, \mathbf{X}_) \end{bmatrix} \right)$$

Here:

$K(\mathbf{X}, \mathbf{X})$ is the $n \times n$ training covariance matrix with entries $[K(\mathbf{X}, \mathbf{X})]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
$K(\mathbf{X}*, \mathbf{X}*)$ is the $m \times m$ test covariance matrix
$K(\mathbf{X}, \mathbf{X}*) = K(\mathbf{X}*, \mathbf{X})^\top$ is the $n \times m$ cross-covariance matrix between training and test points

Visualizing the Joint Distribution

Think of this joint Gaussian as defining a probability distribution over all possible configurations of function values at both training and test points simultaneously. The off-diagonal block K(X, X*) captures how knowledge at training points informs our beliefs at test points—this is the correlation that enables prediction.

The Block Structure Matters

The block structure of the joint covariance matrix encodes crucial information:

Top-left block $K(\mathbf{X}, \mathbf{X})$: Prior uncertainty about function values at training locations. Diagonal entries represent variance at each point; off-diagonal entries capture covariance between points.
Bottom-right block $K(\mathbf{X}*, \mathbf{X}*)$: Prior uncertainty at test locations before seeing any data.
Off-diagonal blocks $K(\mathbf{X}, \mathbf{X}*)$ and $K(\mathbf{X}*, \mathbf{X})$: The key to GP regression! These blocks tell us how function values at test points correlate with function values at training points. Strong correlation means observing training data will substantially inform our predictions.

When the kernel decays with distance (as in the RBF kernel), test points far from all training points will have small entries in the cross-covariance block, meaning our predictions there remain uncertain.

Conditioning on Observations: The Core Operation

Now comes the heart of GP inference: conditioning the joint Gaussian on observed data. This is where the magic of Gaussians becomes apparent—conditioning a Gaussian random vector on a subset of its components yields another Gaussian, with closed-form mean and covariance.

The General Conditioning Formula

For a multivariate Gaussian partitioned as: $$\begin{bmatrix} \mathbf{a} \ \mathbf{b} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu}a \ \boldsymbol{\mu}b \end{bmatrix}, \begin{bmatrix} \Sigma{aa} & \Sigma{ab} \ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right)$$

The conditional distribution of $\mathbf{b}$ given $\mathbf{a} = \mathbf{a}0$ is: $$\mathbf{b} | \mathbf{a} = \mathbf{a}0 \sim \mathcal{N}(\boldsymbol{\mu}{b|a}, \Sigma{b|a})$$

where:

Conditional mean: $\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \Sigma{ba} \Sigma{aa}^{-1} (\mathbf{a}_0 - \boldsymbol{\mu}_a)$
Conditional covariance: $\Sigma_{b|a} = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$

The Critical Insight

Notice that the conditional covariance Σ_{b|a} does NOT depend on the observed value a₀ — only on which variables were observed. The conditional mean DOES depend on what we observed. This means: (1) Prediction uncertainty depends only on input locations, not output values. (2) The predicted function values depend on the actual observations.

Intuition Behind the Conditioning Formula

The conditional mean formula has a beautiful interpretation:

$$\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \underbrace{\Sigma{ba} \Sigma{aa}^{-1}}_{\text{regression coefficients}} \underbrace{(\mathbf{a}_0 - \boldsymbol{\mu}a)}{\text{deviation from prior}}$$

We start with the prior mean $\boldsymbol{\mu}_b$
We adjust based on how far the observations $\mathbf{a}_0$ deviated from the prior mean $\boldsymbol{\mu}_a$
The adjustment is weighted by $\Sigma_{ba} \Sigma_{aa}^{-1}$, which captures how strongly $\mathbf{b}$ correlates with $\mathbf{a}$

The conditional covariance formula:

$$\Sigma_{b|a} = \underbrace{\Sigma_{bb}}{\text{prior uncertainty}} - \underbrace{\Sigma{ba} \Sigma_{aa}^{-1} \Sigma_{ab}}_{\text{variance reduction}}$$

We start with our prior uncertainty $\Sigma_{bb}$
We subtract the uncertainty we've resolved by observing $\mathbf{a}$
The subtracted term is always positive semi-definite, so observing data can never increase uncertainty (in the Gaussian case)

gaussian_conditioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def gaussian_conditional(mu_a, mu_b, sigma_aa, sigma_ab, sigma_bb, a_observed):
    """
    Compute conditional distribution p(b | a = a_observed)
    for jointly Gaussian random vectors [a, b].
    
    Parameters:
        mu_a: Prior mean of a (n_a,)
        mu_b: Prior mean of b (n_b,)
        sigma_aa: Covariance of a (n_a, n_a)
        sigma_ab: Cross-covariance between a and b (n_a, n_b)
        sigma_bb: Covariance of b (n_b, n_b)
        a_observed: Observed value of a (n_a,)
    
    Returns:
        mu_b_given_a: Conditional mean (n_b,)
        sigma_b_given_a: Conditional covariance (n_b, n_b)
    """
    # For numerical stability, use Cholesky decomposition and solve
    # Instead of explicitly computing inverse
    L = np.linalg.cholesky(sigma_aa)  # sigma_aa = L @ L.T
    
    # Solve L @ alpha_1 = (a_observed - mu_a), then L.T @ alpha = alpha_1
    # This computes sigma_aa^{-1} @ (a_observed - mu_a)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, a_observed - mu_a))
    
    # Conditional mean: mu_b + sigma_ba @ sigma_aa^{-1} @ (a - mu_a)
    # Note: sigma_ba = sigma_ab.T
    mu_b_given_a = mu_b + sigma_ab.T @ alpha
    
    # Solve for sigma_aa^{-1} @ sigma_ab = V
    V = np.linalg.solve(L.T, np.linalg.solve(L, sigma_ab))
    
    # Conditional covariance: sigma_bb - sigma_ba @ sigma_aa^{-1} @ sigma_ab
    sigma_b_given_a = sigma_bb - sigma_ab.T @ V
    
    return mu_b_given_a, sigma_b_given_a
 
# Example: Two-dimensional Gaussian
# Joint distribution of [a, b]
mu = np.array([0.0, 0.0])  # Prior means
sigma = np.array([
    [1.0, 0.8],  # Var(a)=1, Cov(a,b)=0.8
    [0.8, 1.0]   # Cov(b,a)=0.8, Var(b)=1
])
 
# Observe a = 2.0
a_obs = np.array([2.0])
 
# Compute conditional b | a=2
mu_b_cond, sigma_b_cond = gaussian_conditional(
    mu_a=mu[:1], mu_b=mu[1:],
    sigma_aa=sigma[:1, :1], sigma_ab=sigma[:1, 1:], sigma_bb=sigma[1:, 1:],
    a_observed=a_obs
)
 
print(f"p(b | a=2) = N({mu_b_cond[0]:.3f}, {sigma_b_cond[0,0]:.3f})")
# Output: p(b | a=2) = N(1.600, 0.360)
# Interpretation: High correlation (0.8) means observing a=2 shifts 
# our expectation of b to 1.6, and reduces variance from 1.0 to 0.36

The Gaussian Process Posterior

We now apply the Gaussian conditioning formula to our GP setting. In the noise-free case where $\mathbf{y} = \mathbf{f}$ (observations equal true function values), we condition the joint distribution on the observed training outputs.

Identifying terms with the conditioning formula:

$\mathbf{a} \leftrightarrow \mathbf{f}$ (training function values)
$\mathbf{b} \leftrightarrow \mathbf{f}_*$ (test function values)
$\Sigma_{aa} \leftrightarrow K(\mathbf{X}, \mathbf{X})$
$\Sigma_{ab} \leftrightarrow K(\mathbf{X}, \mathbf{X}_*)$
$\Sigma_{bb} \leftrightarrow K(\mathbf{X}*, \mathbf{X}*)$
$\mathbf{a}_0 \leftrightarrow \mathbf{y}$ (observed outputs)

Assuming zero prior mean ($\mathbf{m}(\mathbf{x}) = 0$), the GP posterior is:

$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_, \Sigma_*)$$

where:

$$\boxed{\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{y}}$$

$$\boxed{\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K(\mathbf{X}_, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} K(\mathbf{X}, \mathbf{X}_*)}$$

The Fundamental GP Regression Equations

These two equations are the heart of Gaussian Process regression. The posterior mean μ* is a linear combination of the training outputs y, weighted by the kernel similarity to test points. The posterior covariance Σ* starts from prior uncertainty K(X*,X*) and subtracts the information gained from training data.

Interpreting the Posterior Mean

The posterior mean can be rewritten as:

$$\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) \boldsymbol{\alpha}, \quad \text{where } \boldsymbol{\alpha} = K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{y}$$

This reveals that the GP posterior mean is a linear combination of kernel functions centered at training points:

$$\mu_(\mathbf{x}_) = \sum_{i=1}^{n} \alpha_i , k(\mathbf{x}_*, \mathbf{x}_i)$$

The coefficients $\boldsymbol{\alpha}$ are determined by solving a linear system. This representation connects GPs to kernel methods: the posterior mean lies in the reproducing kernel Hilbert space (RKHS) spanned by kernel functions centered at training points.

Interpreting the Posterior Covariance

The posterior covariance has the form:

$$\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K(\mathbf{X}_, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} K(\mathbf{X}, \mathbf{X}_*)$$

Key insights:

At training points: When $\mathbf{X}_* = \mathbf{X}$, the posterior variance is $K - K K^{-1} K = 0$. The GP interpolates training data exactly (in the noise-free case).
Far from training data: If test points have negligible kernel values with all training points, $K(\mathbf{X}_*, \mathbf{X}) \approx 0$, so posterior variance equals prior variance.
Uncertainty reduction: The posterior variance is always less than or equal to the prior variance. Observing data can only reduce uncertainty.

Posterior Behavior at Different Test Locations
Test Location	Posterior Mean	Posterior Variance	Interpretation
At training point xᵢ	Exactly yᵢ	Zero	Perfect interpolation
Near training points	Weighted average of nearby yᵢ	Low	High confidence predictions
Between training points	Smooth interpolation	Moderate	Kernel-guided interpolation
Far from all training data	Prior mean (0)	Prior variance	Reverts to prior beliefs

Computational Implementation and Numerical Stability

While the posterior equations are elegant, naive implementation invites numerical disaster. Computing $K(\mathbf{X}, \mathbf{X})^{-1}$ directly is:

Numerically unstable: Covariance matrices can be ill-conditioned, especially with nearby training points or poorly chosen kernel hyperparameters.
Computationally expensive: Matrix inversion is $\mathcal{O}(n^3)$ and doesn't leverage matrix structure.

The Cholesky Decomposition Approach

The numerically stable approach uses the Cholesky decomposition: $$K(\mathbf{X}, \mathbf{X}) = \mathbf{L} \mathbf{L}^\top$$

where $\mathbf{L}$ is a lower triangular matrix. This decomposition:

Exists if and only if $K$ is positive definite
Requires only $\mathcal{O}(n^3/3)$ operations (faster than general inversion)
Provides numerical stability through triangular solves

To compute the posterior mean: $$\boldsymbol{\alpha} = K^{-1} \mathbf{y} = (\mathbf{L} \mathbf{L}^\top)^{-1} \mathbf{y} = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{y}$$

We solve:

$\mathbf{L} \mathbf{z} = \mathbf{y}$ (forward substitution) → $\mathbf{z} = \mathbf{L}^{-1} \mathbf{y}$
$\mathbf{L}^\top \boldsymbol{\alpha} = \mathbf{z}$ (backward substitution) → $\boldsymbol{\alpha} = K^{-1} \mathbf{y}$

Then: $\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) \boldsymbol{\alpha}$

gp_posterior_stable.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """Compute RBF (Squared Exponential) kernel matrix."""
    # Pairwise squared distances
    sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1)
    return variance * np.exp(-0.5 * sq_dist / length_scale**2)
 
def gp_posterior(X_train, y_train, X_test, kernel_fn, jitter=1e-8):
    """
    Compute GP posterior mean and variance at test points.
    
    Parameters:
        X_train: Training inputs (n, d)
        y_train: Training outputs (n,)
        X_test: Test inputs (m, d)
        kernel_fn: Kernel function k(X1, X2) -> (n1, n2) matrix
        jitter: Small value added to diagonal for numerical stability
    
    Returns:
        mu_post: Posterior mean at test points (m,)
        var_post: Posterior variance at test points (m,)
        cov_post: Full posterior covariance matrix (m, m)
    """
    n = X_train.shape[0]
    
    # Compute kernel matrices
    K = kernel_fn(X_train, X_train)  # (n, n)
    K_star = kernel_fn(X_test, X_train)  # (m, n)
    K_star_star = kernel_fn(X_test, X_test)  # (m, m)
    
    # Add jitter for numerical stability
    K = K + jitter * np.eye(n)
    
    # Cholesky decomposition: K = L @ L.T
    L = cholesky(K, lower=True)  # (n, n) lower triangular
    
    # Solve L @ z = y (forward substitution)
    z = solve_triangular(L, y_train, lower=True)  # (n,)
    
    # Solve L.T @ alpha = z (backward substitution)
    alpha = solve_triangular(L.T, z, lower=False)  # (n,)
    
    # Posterior mean: K_star @ alpha = K_star @ K^{-1} @ y
    mu_post = K_star @ alpha  # (m,)
    
    # For posterior covariance, solve L @ V.T = K_star.T
    V = solve_triangular(L, K_star.T, lower=True).T  # (m, n)
    
    # Full posterior covariance: K_** - K_* @ K^{-1} @ K_*^T
    cov_post = K_star_star - V @ V.T  # (m, m)
    
    # Posterior variance (diagonal of covariance)
    var_post = np.diag(cov_post)  # (m,)
    
    return mu_post, var_post, cov_post
 
# Example usage
np.random.seed(42)
 
# Training data: 5 noisy observations of sin function
X_train = np.array([[-4.0], [-3.0], [-1.0], [1.0], [2.0]])
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(5)
 
# Test points: dense grid
X_test = np.linspace(-5, 5, 100).reshape(-1, 1)
 
# Get posterior
mu, var, cov = gp_posterior(
    X_train, y_train, X_test,
    kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0, variance=1.0)
)
 
print(f"Posterior mean range: [{mu.min():.3f}, {mu.max():.3f}]")
print(f"Posterior variance range: [{var.min():.3f}, {var.max():.3f}]")
 
# Verify interpolation at training points
X_at_train = rbf_kernel(X_train, X_train, 1.0, 1.0)
mu_at_train, var_at_train, _ = gp_posterior(
    X_train, y_train, X_train,
    kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, 1.0, 1.0)
)
print(f"\nMax absolute error at training points: {np.max(np.abs(mu_at_train - y_train)):.2e}")
print(f"Max variance at training points: {np.max(var_at_train):.2e}")

The Jitter Term

Adding a small 'jitter' to the diagonal (typically 10⁻⁶ to 10⁻⁸) ensures numerical stability. This effectively assumes tiny observation noise, preventing issues when training points are very close together or when the kernel matrix becomes nearly singular.

Complete Derivation from First Principles

For completeness, let us derive the Gaussian conditioning formula from scratch using density manipulation. This provides deeper insight into why the formula takes its particular form.

Step 1: Write the Joint Density

For jointly Gaussian vectors $\mathbf{a}$ and $\mathbf{b}$, the joint density is:

$$p(\mathbf{a}, \mathbf{b}) = \frac{1}{(2\pi)^{(n_a+n_b)/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} \begin{bmatrix} \mathbf{a} - \boldsymbol{\mu}_a \ \mathbf{b} - \boldsymbol{\mu}_b \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} \mathbf{a} - \boldsymbol{\mu}_a \ \mathbf{b} - \boldsymbol{\mu}_b \end{bmatrix} \right)$$

Step 2: Block Inverse Using Schur Complement

The inverse of the block covariance matrix can be written using the Schur complement. Let $S = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$ (the Schur complement of $\Sigma_{aa}$). Then:

$$\Sigma^{-1} = \begin{bmatrix} \Sigma_{aa}^{-1} + \Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \Sigma_{ba} \Sigma_{aa}^{-1} & -\Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \ -S^{-1} \Sigma_{ba} \Sigma_{aa}^{-1} & S^{-1} \end{bmatrix}$$

Step 3: Expand the Quadratic Form

Define $\delta_a = \mathbf{a} - \boldsymbol{\mu}_a$ and $\delta_b = \mathbf{b} - \boldsymbol{\mu}_b$. The exponent in the joint density is:

$$Q = \begin{bmatrix} \delta_a \ \delta_b \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} \delta_a \ \delta_b \end{bmatrix}$$

Expanding using the block inverse:

$$Q = \delta_a^\top \Sigma_{aa}^{-1} \delta_a + \delta_a^\top \Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a - 2 \delta_a^\top \Sigma_{aa}^{-1} \Sigma_{ab} S^{-1} \delta_b + \delta_b^\top S^{-1} \delta_b$$

Step 4: Complete the Square in $\delta_b$

Grouping terms involving $\delta_b$:

$$Q = [\text{terms in } \delta_a \text{ only}] + (\delta_b - \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a)^\top S^{-1} (\delta_b - \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a)$$

Step 5: Apply Bayes' Rule

$$p(\mathbf{b} | \mathbf{a}) = \frac{p(\mathbf{a}, \mathbf{b})}{p(\mathbf{a})} \propto \exp\left( -\frac{1}{2} (\mathbf{b} - \boldsymbol{\mu}{b|a})^\top S^{-1} (\mathbf{b} - \boldsymbol{\mu}{b|a}) \right)$$

where: $$\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \Sigma{ba} \Sigma{aa}^{-1} (\mathbf{a} - \boldsymbol{\mu}_a)$$

This is a Gaussian with mean $\boldsymbol{\mu}{b|a}$ and covariance $S = \Sigma{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$. QED.

The Schur Complement Connection

The Schur complement S = Σ_{bb} - Σ_{ba}Σ_{aa}⁻¹Σ_{ab} appears naturally as the conditional covariance. This identity is fundamental in linear algebra and appears throughout probabilistic modeling, Kalman filtering, and structured matrix computations.

Extension to Non-Zero Prior Mean

We have so far assumed zero prior mean for simplicity, but the extension to arbitrary mean functions is straightforward.

With prior mean function $m(\mathbf{x})$, the joint distribution becomes:

$$\begin{bmatrix} \mathbf{f} \ \mathbf{f}* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{m} \ \mathbf{m}* \end{bmatrix}, \begin{bmatrix} K & K_* \ K_*^\top & K_{**} \end{bmatrix} \right)$$

where $\mathbf{m} = [m(\mathbf{x}1), \ldots, m(\mathbf{x}n)]^\top$ and $\mathbf{m}* = [m(\mathbf{x}^{(1)}), \ldots, m(\mathbf{x}_^{(m)})]^\top$.

Applying the conditioning formula:

$$\boldsymbol{\mu}* = \mathbf{m}* + K_*^\top K^{-1} (\mathbf{y} - \mathbf{m})$$

$$\Sigma_* = K_{**} - K_^\top K^{-1} K_$$

When to Use Non-Zero Mean Functions

Non-zero mean functions are useful when you have prior knowledge about the function's general trend. Common choices include: (1) Constant mean m(x) = c for known baseline levels, (2) Linear mean m(x) = β₀ + β₁ᵀx for known linear trends, (3) Parametric means for domain-specific prior knowledge. The GP then learns deviations from this mean.

Practical Implications

The choice of mean function affects predictions, especially in regions far from training data:

Mean Function	Behavior Far From Data	Use Case
Zero mean $m(\mathbf{x}) = 0$	Predictions revert to 0	When 0 is a reasonable default
Constant mean $m(\mathbf{x}) = c$	Predictions revert to $c$	Known baseline/average
Linear mean $m(\mathbf{x}) = \boldsymbol{\beta}^\top \mathbf{x}$	Predictions follow linear trend	Known overall trend
Prior from simpler model	Combines simple + complex model	Transfer learning

For many applications, zero mean is acceptable because the kernel already encodes substantial prior knowledge. However, for extrapolation-heavy problems, the mean function becomes critical since the GP reverts to it beyond the training data.

Summary: The GP Posterior

We have derived the complete posterior distribution for Gaussian Process regression. Let us consolidate the key results:

Key Results

•GP posterior is Gaussian: Conditioning a GP prior on Gaussian observations yields a Gaussian posterior distribution.
•Closed-form solution: The posterior mean μ* = K(X*,X)K(X,X)⁻¹y and covariance Σ* = K(X*,X*) - K(X*,X)K(X,X)⁻¹K(X,X*) have exact analytic expressions.
•Posterior mean is a kernel expansion: μ*(x*) = Σᵢ αᵢ k(x*, xᵢ) where α = K⁻¹y, connecting to RKHS theory.
•Uncertainty decreases with data: The posterior variance is always less than or equal to the prior variance.
•Exact interpolation: In the noise-free case, the GP passes exactly through training points.
•Numerical stability via Cholesky: Use L L^T = K decomposition and triangular solves, never explicit matrix inversion.

What's Next

We have derived the posterior distribution, but we haven't yet discussed how to use it for prediction. In the next page, we explore the predictive distribution—how to make probabilistic predictions at new test points and quantify prediction uncertainty. We will also see how the posterior covariance structure enables principled uncertainty quantification that distinguishes GPs from most other regression methods.

Page Complete

You now understand how to derive the Gaussian Process posterior from first principles. This closed-form posterior—unique among flexible non-parametric methods—is what makes GP regression both mathematically elegant and computationally tractable for small to medium datasets.

Posterior Derivation

The Heart of Gaussian Process Inference

What You Will Learn

Prerequisites and Problem Setup

Before diving into the posterior derivation, let us establish the setting precisely. We have:

GP Prior: We assume a Gaussian Process prior over functions $f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$, where:

$m(\mathbf{x})$ is the mean function (often assumed zero: $m(\mathbf{x}) = 0$)
$k(\mathbf{x}, \mathbf{x}')$ is the covariance function (kernel), encoding our prior beliefs about function smoothness, periodicity, etc.

Observation Model: For now, we assume noise-free observations, meaning $y_i = f(\mathbf{x}_i)$ exactly. We will relax this in later sections.

Notation Convention

Mathematical Prerequisites

•Multivariate Gaussian Distributions: The joint distribution of multiple random variables, parameterized by mean vector and covariance matrix.
•Gaussian Conditioning: The formula for computing conditional distributions from joint Gaussians—the core operation powering GP inference.
•Positive Definite Matrices: Covariance matrices must be positive semi-definite; kernel functions must satisfy Mercer's condition.
•Matrix Inversion and Cholesky Decomposition: Computational tools for implementing GP inference efficiently.
•Schur Complement: A matrix identity that simplifies conditional covariance computation.

Constructing the Joint Distribution

Here:

$K(\mathbf{X}, \mathbf{X})$ is the $n \times n$ training covariance matrix with entries $[K(\mathbf{X}, \mathbf{X})]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
$K(\mathbf{X}*, \mathbf{X}*)$ is the $m \times m$ test covariance matrix
$K(\mathbf{X}, \mathbf{X}*) = K(\mathbf{X}*, \mathbf{X})^\top$ is the $n \times m$ cross-covariance matrix between training and test points

Visualizing the Joint Distribution

The Block Structure Matters

The block structure of the joint covariance matrix encodes crucial information:

Top-left block $K(\mathbf{X}, \mathbf{X})$: Prior uncertainty about function values at training locations. Diagonal entries represent variance at each point; off-diagonal entries capture covariance between points.
Bottom-right block $K(\mathbf{X}*, \mathbf{X}*)$: Prior uncertainty at test locations before seeing any data.
Off-diagonal blocks $K(\mathbf{X}, \mathbf{X}*)$ and $K(\mathbf{X}*, \mathbf{X})$: The key to GP regression! These blocks tell us how function values at test points correlate with function values at training points. Strong correlation means observing training data will substantially inform our predictions.

Conditioning on Observations: The Core Operation

The General Conditioning Formula

The conditional distribution of $\mathbf{b}$ given $\mathbf{a} = \mathbf{a}0$ is: $$\mathbf{b} | \mathbf{a} = \mathbf{a}0 \sim \mathcal{N}(\boldsymbol{\mu}{b|a}, \Sigma{b|a})$$

where:

Conditional mean: $\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \Sigma{ba} \Sigma{aa}^{-1} (\mathbf{a}_0 - \boldsymbol{\mu}_a)$
Conditional covariance: $\Sigma_{b|a} = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$

The Critical Insight

Intuition Behind the Conditioning Formula

The conditional mean formula has a beautiful interpretation:

We start with the prior mean $\boldsymbol{\mu}_b$
We adjust based on how far the observations $\mathbf{a}_0$ deviated from the prior mean $\boldsymbol{\mu}_a$
The adjustment is weighted by $\Sigma_{ba} \Sigma_{aa}^{-1}$, which captures how strongly $\mathbf{b}$ correlates with $\mathbf{a}$

The conditional covariance formula:

$$\Sigma_{b|a} = \underbrace{\Sigma_{bb}}{\text{prior uncertainty}} - \underbrace{\Sigma{ba} \Sigma_{aa}^{-1} \Sigma_{ab}}_{\text{variance reduction}}$$

We start with our prior uncertainty $\Sigma_{bb}$
We subtract the uncertainty we've resolved by observing $\mathbf{a}$
The subtracted term is always positive semi-definite, so observing data can never increase uncertainty (in the Gaussian case)

gaussian_conditioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def gaussian_conditional(mu_a, mu_b, sigma_aa, sigma_ab, sigma_bb, a_observed):
    """
    Compute conditional distribution p(b | a = a_observed)
    for jointly Gaussian random vectors [a, b].
    
    Parameters:
        mu_a: Prior mean of a (n_a,)
        mu_b: Prior mean of b (n_b,)
        sigma_aa: Covariance of a (n_a, n_a)
        sigma_ab: Cross-covariance between a and b (n_a, n_b)
        sigma_bb: Covariance of b (n_b, n_b)
        a_observed: Observed value of a (n_a,)
    
    Returns:
        mu_b_given_a: Conditional mean (n_b,)
        sigma_b_given_a: Conditional covariance (n_b, n_b)
    """
    # For numerical stability, use Cholesky decomposition and solve
    # Instead of explicitly computing inverse
    L = np.linalg.cholesky(sigma_aa)  # sigma_aa = L @ L.T
    
    # Solve L @ alpha_1 = (a_observed - mu_a), then L.T @ alpha = alpha_1
    # This computes sigma_aa^{-1} @ (a_observed - mu_a)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, a_observed - mu_a))
    
    # Conditional mean: mu_b + sigma_ba @ sigma_aa^{-1} @ (a - mu_a)
    # Note: sigma_ba = sigma_ab.T
    mu_b_given_a = mu_b + sigma_ab.T @ alpha
    
    # Solve for sigma_aa^{-1} @ sigma_ab = V
    V = np.linalg.solve(L.T, np.linalg.solve(L, sigma_ab))
    
    # Conditional covariance: sigma_bb - sigma_ba @ sigma_aa^{-1} @ sigma_ab
    sigma_b_given_a = sigma_bb - sigma_ab.T @ V
    
    return mu_b_given_a, sigma_b_given_a
 
# Example: Two-dimensional Gaussian
# Joint distribution of [a, b]
mu = np.array([0.0, 0.0])  # Prior means
sigma = np.array([
    [1.0, 0.8],  # Var(a)=1, Cov(a,b)=0.8
    [0.8, 1.0]   # Cov(b,a)=0.8, Var(b)=1
])
 
# Observe a = 2.0
a_obs = np.array([2.0])
 
# Compute conditional b | a=2
mu_b_cond, sigma_b_cond = gaussian_conditional(
    mu_a=mu[:1], mu_b=mu[1:],
    sigma_aa=sigma[:1, :1], sigma_ab=sigma[:1, 1:], sigma_bb=sigma[1:, 1:],
    a_observed=a_obs
)
 
print(f"p(b | a=2) = N({mu_b_cond[0]:.3f}, {sigma_b_cond[0,0]:.3f})")
# Output: p(b | a=2) = N(1.600, 0.360)
# Interpretation: High correlation (0.8) means observing a=2 shifts 
# our expectation of b to 1.6, and reduces variance from 1.0 to 0.36

The Gaussian Process Posterior

Identifying terms with the conditioning formula:

$\mathbf{a} \leftrightarrow \mathbf{f}$ (training function values)
$\mathbf{b} \leftrightarrow \mathbf{f}_*$ (test function values)
$\Sigma_{aa} \leftrightarrow K(\mathbf{X}, \mathbf{X})$
$\Sigma_{ab} \leftrightarrow K(\mathbf{X}, \mathbf{X}_*)$
$\Sigma_{bb} \leftrightarrow K(\mathbf{X}*, \mathbf{X}*)$
$\mathbf{a}_0 \leftrightarrow \mathbf{y}$ (observed outputs)

Assuming zero prior mean ($\mathbf{m}(\mathbf{x}) = 0$), the GP posterior is:

$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_, \Sigma_*)$$

where:

$$\boxed{\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{y}}$$

$$\boxed{\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K(\mathbf{X}_, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} K(\mathbf{X}, \mathbf{X}_*)}$$

The Fundamental GP Regression Equations

Interpreting the Posterior Mean

The posterior mean can be rewritten as:

$$\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) \boldsymbol{\alpha}, \quad \text{where } \boldsymbol{\alpha} = K(\mathbf{X}, \mathbf{X})^{-1} \mathbf{y}$$

This reveals that the GP posterior mean is a linear combination of kernel functions centered at training points:

$$\mu_(\mathbf{x}_) = \sum_{i=1}^{n} \alpha_i , k(\mathbf{x}_*, \mathbf{x}_i)$$

Interpreting the Posterior Covariance

The posterior covariance has the form:

$$\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K(\mathbf{X}_, \mathbf{X}) K(\mathbf{X}, \mathbf{X})^{-1} K(\mathbf{X}, \mathbf{X}_*)$$

Key insights:

At training points: When $\mathbf{X}_* = \mathbf{X}$, the posterior variance is $K - K K^{-1} K = 0$. The GP interpolates training data exactly (in the noise-free case).
Far from training data: If test points have negligible kernel values with all training points, $K(\mathbf{X}_*, \mathbf{X}) \approx 0$, so posterior variance equals prior variance.
Uncertainty reduction: The posterior variance is always less than or equal to the prior variance. Observing data can only reduce uncertainty.

Posterior Behavior at Different Test Locations
Test Location	Posterior Mean	Posterior Variance	Interpretation
At training point xᵢ	Exactly yᵢ	Zero	Perfect interpolation
Near training points	Weighted average of nearby yᵢ	Low	High confidence predictions
Between training points	Smooth interpolation	Moderate	Kernel-guided interpolation
Far from all training data	Prior mean (0)	Prior variance	Reverts to prior beliefs

Computational Implementation and Numerical Stability

While the posterior equations are elegant, naive implementation invites numerical disaster. Computing $K(\mathbf{X}, \mathbf{X})^{-1}$ directly is:

Numerically unstable: Covariance matrices can be ill-conditioned, especially with nearby training points or poorly chosen kernel hyperparameters.
Computationally expensive: Matrix inversion is $\mathcal{O}(n^3)$ and doesn't leverage matrix structure.

The Cholesky Decomposition Approach

The numerically stable approach uses the Cholesky decomposition: $$K(\mathbf{X}, \mathbf{X}) = \mathbf{L} \mathbf{L}^\top$$

where $\mathbf{L}$ is a lower triangular matrix. This decomposition:

Exists if and only if $K$ is positive definite
Requires only $\mathcal{O}(n^3/3)$ operations (faster than general inversion)
Provides numerical stability through triangular solves

To compute the posterior mean: $$\boldsymbol{\alpha} = K^{-1} \mathbf{y} = (\mathbf{L} \mathbf{L}^\top)^{-1} \mathbf{y} = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{y}$$

We solve:

$\mathbf{L} \mathbf{z} = \mathbf{y}$ (forward substitution) → $\mathbf{z} = \mathbf{L}^{-1} \mathbf{y}$
$\mathbf{L}^\top \boldsymbol{\alpha} = \mathbf{z}$ (backward substitution) → $\boldsymbol{\alpha} = K^{-1} \mathbf{y}$

Then: $\boldsymbol{\mu}* = K(\mathbf{X}*, \mathbf{X}) \boldsymbol{\alpha}$

gp_posterior_stable.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """Compute RBF (Squared Exponential) kernel matrix."""
    # Pairwise squared distances
    sq_dist = np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1)
    return variance * np.exp(-0.5 * sq_dist / length_scale**2)
 
def gp_posterior(X_train, y_train, X_test, kernel_fn, jitter=1e-8):
    """
    Compute GP posterior mean and variance at test points.
    
    Parameters:
        X_train: Training inputs (n, d)
        y_train: Training outputs (n,)
        X_test: Test inputs (m, d)
        kernel_fn: Kernel function k(X1, X2) -> (n1, n2) matrix
        jitter: Small value added to diagonal for numerical stability
    
    Returns:
        mu_post: Posterior mean at test points (m,)
        var_post: Posterior variance at test points (m,)
        cov_post: Full posterior covariance matrix (m, m)
    """
    n = X_train.shape[0]
    
    # Compute kernel matrices
    K = kernel_fn(X_train, X_train)  # (n, n)
    K_star = kernel_fn(X_test, X_train)  # (m, n)
    K_star_star = kernel_fn(X_test, X_test)  # (m, m)
    
    # Add jitter for numerical stability
    K = K + jitter * np.eye(n)
    
    # Cholesky decomposition: K = L @ L.T
    L = cholesky(K, lower=True)  # (n, n) lower triangular
    
    # Solve L @ z = y (forward substitution)
    z = solve_triangular(L, y_train, lower=True)  # (n,)
    
    # Solve L.T @ alpha = z (backward substitution)
    alpha = solve_triangular(L.T, z, lower=False)  # (n,)
    
    # Posterior mean: K_star @ alpha = K_star @ K^{-1} @ y
    mu_post = K_star @ alpha  # (m,)
    
    # For posterior covariance, solve L @ V.T = K_star.T
    V = solve_triangular(L, K_star.T, lower=True).T  # (m, n)
    
    # Full posterior covariance: K_** - K_* @ K^{-1} @ K_*^T
    cov_post = K_star_star - V @ V.T  # (m, m)
    
    # Posterior variance (diagonal of covariance)
    var_post = np.diag(cov_post)  # (m,)
    
    return mu_post, var_post, cov_post
 
# Example usage
np.random.seed(42)
 
# Training data: 5 noisy observations of sin function
X_train = np.array([[-4.0], [-3.0], [-1.0], [1.0], [2.0]])
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(5)
 
# Test points: dense grid
X_test = np.linspace(-5, 5, 100).reshape(-1, 1)
 
# Get posterior
mu, var, cov = gp_posterior(
    X_train, y_train, X_test,
    kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0, variance=1.0)
)
 
print(f"Posterior mean range: [{mu.min():.3f}, {mu.max():.3f}]")
print(f"Posterior variance range: [{var.min():.3f}, {var.max():.3f}]")
 
# Verify interpolation at training points
X_at_train = rbf_kernel(X_train, X_train, 1.0, 1.0)
mu_at_train, var_at_train, _ = gp_posterior(
    X_train, y_train, X_train,
    kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, 1.0, 1.0)
)
print(f"\nMax absolute error at training points: {np.max(np.abs(mu_at_train - y_train)):.2e}")
print(f"Max variance at training points: {np.max(var_at_train):.2e}")

The Jitter Term

Complete Derivation from First Principles

For completeness, let us derive the Gaussian conditioning formula from scratch using density manipulation. This provides deeper insight into why the formula takes its particular form.

Step 1: Write the Joint Density

For jointly Gaussian vectors $\mathbf{a}$ and $\mathbf{b}$, the joint density is:

Step 2: Block Inverse Using Schur Complement

The inverse of the block covariance matrix can be written using the Schur complement. Let $S = \Sigma_{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$ (the Schur complement of $\Sigma_{aa}$). Then:

Step 3: Expand the Quadratic Form

Define $\delta_a = \mathbf{a} - \boldsymbol{\mu}_a$ and $\delta_b = \mathbf{b} - \boldsymbol{\mu}_b$. The exponent in the joint density is:

$$Q = \begin{bmatrix} \delta_a \ \delta_b \end{bmatrix}^\top \Sigma^{-1} \begin{bmatrix} \delta_a \ \delta_b \end{bmatrix}$$

Expanding using the block inverse:

Step 4: Complete the Square in $\delta_b$

Grouping terms involving $\delta_b$:

$$Q = [\text{terms in } \delta_a \text{ only}] + (\delta_b - \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a)^\top S^{-1} (\delta_b - \Sigma_{ba} \Sigma_{aa}^{-1} \delta_a)$$

Step 5: Apply Bayes' Rule

where: $$\boldsymbol{\mu}{b|a} = \boldsymbol{\mu}b + \Sigma{ba} \Sigma{aa}^{-1} (\mathbf{a} - \boldsymbol{\mu}_a)$$

This is a Gaussian with mean $\boldsymbol{\mu}{b|a}$ and covariance $S = \Sigma{bb} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{ab}$. QED.

The Schur Complement Connection

Extension to Non-Zero Prior Mean

We have so far assumed zero prior mean for simplicity, but the extension to arbitrary mean functions is straightforward.

With prior mean function $m(\mathbf{x})$, the joint distribution becomes:

where $\mathbf{m} = [m(\mathbf{x}1), \ldots, m(\mathbf{x}n)]^\top$ and $\mathbf{m}* = [m(\mathbf{x}^{(1)}), \ldots, m(\mathbf{x}_^{(m)})]^\top$.

Applying the conditioning formula:

$$\boldsymbol{\mu}* = \mathbf{m}* + K_*^\top K^{-1} (\mathbf{y} - \mathbf{m})$$

$$\Sigma_* = K_{**} - K_^\top K^{-1} K_$$

When to Use Non-Zero Mean Functions

Practical Implications

The choice of mean function affects predictions, especially in regions far from training data:

Mean Function	Behavior Far From Data	Use Case
Zero mean $m(\mathbf{x}) = 0$	Predictions revert to 0	When 0 is a reasonable default
Constant mean $m(\mathbf{x}) = c$	Predictions revert to $c$	Known baseline/average
Linear mean $m(\mathbf{x}) = \boldsymbol{\beta}^\top \mathbf{x}$	Predictions follow linear trend	Known overall trend
Prior from simpler model	Combines simple + complex model	Transfer learning

Summary: The GP Posterior

We have derived the complete posterior distribution for Gaussian Process regression. Let us consolidate the key results:

Key Results

•GP posterior is Gaussian: Conditioning a GP prior on Gaussian observations yields a Gaussian posterior distribution.
•Closed-form solution: The posterior mean μ* = K(X*,X)K(X,X)⁻¹y and covariance Σ* = K(X*,X*) - K(X*,X)K(X,X)⁻¹K(X,X*) have exact analytic expressions.
•Posterior mean is a kernel expansion: μ*(x*) = Σᵢ αᵢ k(x*, xᵢ) where α = K⁻¹y, connecting to RKHS theory.
•Uncertainty decreases with data: The posterior variance is always less than or equal to the prior variance.
•Exact interpolation: In the noise-free case, the GP passes exactly through training points.
•Numerical stability via Cholesky: Use L L^T = K decomposition and triangular solves, never explicit matrix inversion.

What's Next

Page Complete