Gaussian Processes Introduction - Learning Module

Loading content...

0/278

Posterior Inference

From Prior Beliefs to Data-Informed Predictions

We've established that a Gaussian Process defines a prior distribution over functions, specified through a mean function and kernel. But priors alone don't make predictions—we need to condition on observed data to update our beliefs.

This is where the magic of Gaussian Processes truly shines. Unlike most machine learning methods that produce point predictions, GP posterior inference yields complete probability distributions over function values. At every new input, we obtain not just a prediction but a full quantification of our uncertainty.

The remarkable fact is that this posterior computation has a closed-form solution. No sampling, no variational approximations, no iterative optimization—just matrix algebra. This analytical tractability is a direct consequence of the Gaussian distribution's closure under conditioning.

What You Will Learn

By the end of this page, you will understand how to derive the GP posterior distribution through conditioning, the closed-form expressions for posterior mean and variance, how observations "pin down" the function at data points while uncertainty remains elsewhere, and the computational costs and numerical considerations for GP inference.

The Conditioning Framework

Suppose we have observed data $\mathcal{D} = {(x_i, y_i)}{i=1}^n$ where $y_i = f(x_i) + \epsilon_i$ with observation noise $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$. We want to predict $f(x)$ at a new test point $x_$.

The Bayesian Framework:

Prior: $f \sim \mathcal{GP}(m, k)$ — our beliefs before seeing data
Likelihood: $y_i | f(x_i) \sim \mathcal{N}(f(x_i), \sigma_n^2)$ — the observation model
Posterior: $f | \mathcal{D} \sim \mathcal{GP}(m_{post}, k_{post})$ — updated beliefs

The posterior is also a GP, thanks to the Gaussian's closure properties. Our task is to derive the posterior mean and covariance functions.

Setting Up the Joint Distribution:

Consider the joint distribution of the training observations $\mathbf{y} = [y_1, \ldots, y_n]^T$ and the test function value $f_* = f(x_*)$. Under our GP prior:

$$\begin{pmatrix} \mathbf{y} \ f_* \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mathbf{m} \ m_* \end{pmatrix}, \begin{pmatrix} K + \sigma_n^2 I & \mathbf{k}* \ \mathbf{k}*^T & k_{**} \end{pmatrix} \right)$$

where:

$\mathbf{m} = [m(x_1), \ldots, m(x_n)]^T$ — prior mean at training points
$m_* = m(x_*)$ — prior mean at test point
$K$ is the $n \times n$ kernel matrix with $K_{ij} = k(x_i, x_j)$
$\mathbf{k}* = [k(x_1, x), \ldots, k(x_n, x_)]^T$ — cross-covariance
$k_{**} = k(x_, x_)$ — prior variance at test point

The Noise Term

The diagonal $\sigma_n^2 I$ is added to $K$ because we observe noisy measurements $y$, not the true function values $f(x)$. This noise "inflates" the covariance of observations. If observations were noise-free, we would condition directly on $f(X)$, not $\mathbf{y}$.

Derivation of the Posterior

Now we apply the standard formula for conditioning a multivariate Gaussian. Given the joint distribution:

$$\begin{pmatrix} \mathbf{a} \ \mathbf{b} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{\mu}a \ \boldsymbol{\mu}b \end{pmatrix}, \begin{pmatrix} \Sigma{aa} & \Sigma{ab} \ \Sigma_{ba} & \Sigma_{bb} \end{pmatrix} \right)$$

The conditional distribution of $\mathbf{b}$ given $\mathbf{a}$ is:

$$\mathbf{b} | \mathbf{a} \sim \mathcal{N}\left( \boldsymbol{\mu}b + \Sigma{ba}\Sigma_{aa}^{-1}(\mathbf{a} - \boldsymbol{\mu}a), ; \Sigma{bb} - \Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \right)$$

Applying to Our GP:

Identifying $\mathbf{a} = \mathbf{y}$ (observed) and $\mathbf{b} = f_*$ (to predict):

$$f_* | \mathbf{y}, X, x_* \sim \mathcal{N}(\mu_, \sigma_^2)$$

where:

$$\boxed{\mu_* = m(x_) + \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} (\mathbf{y} - \mathbf{m})}$$

$$\boxed{\sigma_^2 = k(x_, x_) - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*}$$

These are the fundamental equations of GP regression.

Interpreting the Posterior Mean

The posterior mean is a linear combination of the observations: $\mu_* = m_* + \mathbf{k}*^T \boldsymbol{\alpha}$ where $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})$. The weights depend on how correlated the test point is with each training point ($\mathbf{k}*$), modulated by the inverse covariance structure. Points far from training data get weights close to zero, reverting prediction toward the prior mean.

Understanding the Posterior Variance:

The posterior variance has an elegant interpretation:

$$\sigma_^2 = \underbrace{k(x_, x_)}{\text{prior variance}} - \underbrace{\mathbf{k}^T (K + \sigma_n^2 I)^{-1} \mathbf{k}*}{\text{variance reduction}}$$

Prior variance $k(x_, x_)$: Our uncertainty before seeing any data
Variance reduction: How much the observations reduce our uncertainty

The variance reduction is always non-negative (since $K + \sigma_n^2 I$ is positive definite), so the posterior variance is always ≤ prior variance. Observations reduce uncertainty—never increase it.

Key properties of the posterior variance:

At training points: $\sigma_^2 \to 0$ as $x_ \to x_i$ (with $\sigma_n^2 = 0$)
Far from data: $\sigma_^2 \to k(x_, x_*)$ (reverts to prior)
Independent of observed values: Only depends on input locations!

Posterior Mean and Variance Properties
Property	Posterior Mean	Posterior Variance
Depends on observed $\mathbf{y}$?	Yes — linear in $\mathbf{y}$	No — only depends on $X, x_*$
At training point	Interpolates data (if $\sigma_n = 0$)	Zero (perfect knowledge)
Far from data	Reverts to prior mean	Reverts to prior variance
Complexity	$O(n^2)$ per test point	$O(n^2)$ per test point

Complete Implementation

Let's implement GP regression from scratch, carefully handling numerical stability and providing clear visualization.

gp_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
class GaussianProcessRegressor:
    """
    Gaussian Process Regressor with squared exponential kernel.
    
    Provides posterior mean and variance predictions with
    numerically stable implementations.
    """
    
    def __init__(self, length_scale=1.0, signal_variance=1.0, noise_variance=0.1):
        """
        Initialize GP hyperparameters.
        
        Parameters:
        - length_scale: Controls function smoothness
        - signal_variance: Controls function magnitude
        - noise_variance: Observation noise variance
        """
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.noise_variance = noise_variance
        
        # Will be set during fit
        self.X_train = None
        self.y_train = None
        self.L = None  # Cholesky factor
        self.alpha = None  # Precomputed weights
    
    def kernel(self, X1, X2):
        """
        Squared Exponential kernel.
        
        k(x, x') = σ² exp(-||x - x'||² / (2ℓ²))
        """
        # Compute squared distances efficiently
        X1 = np.atleast_2d(X1)
        X2 = np.atleast_2d(X2)
        
        # ||x - x'||² = ||x||² + ||x'||² - 2 x·x'
        X1_sq = np.sum(X1**2, axis=1).reshape(-1, 1)
        X2_sq = np.sum(X2**2, axis=1).reshape(1, -1)
        sq_dist = X1_sq + X2_sq - 2 * X1 @ X2.T
        
        # Ensure non-negative (numerical issues can cause tiny negatives)
        sq_dist = np.maximum(sq_dist, 0)
        
        return self.signal_variance * np.exp(-sq_dist / (2 * self.length_scale**2))
    
    def fit(self, X, y):
        """
        Fit the GP to training data.
        
        Precomputes quantities needed for prediction.
        """
        self.X_train = np.atleast_2d(X)
        self.y_train = y.ravel()
        n = len(self.y_train)
        
        # Compute kernel matrix
        K = self.kernel(self.X_train, self.X_train)
        
        # Add noise variance to diagonal
        K_noisy = K + self.noise_variance * np.eye(n)
        
        # Cholesky decomposition for numerical stability
        # K_noisy = L @ L.T
        self.L = cholesky(K_noisy, lower=True)
        
        # Solve L @ z = y for z, then L.T @ alpha = z for alpha
        # Equivalent to alpha = K_noisy^{-1} @ y
        self.alpha = solve_triangular(
            self.L.T, 
            solve_triangular(self.L, self.y_train, lower=True),
            lower=False
        )
        
        return self
    
    def predict(self, X_test, return_std=True, return_cov=False):
        """
        Predict at test points.
        
        Returns:
        - mean: Posterior mean at test points
        - std (optional): Posterior standard deviation
        - cov (optional): Full posterior covariance matrix
        """
        X_test = np.atleast_2d(X_test)
        n_test = len(X_test)
        
        # Cross-covariance between test and training points
        K_star = self.kernel(X_test, self.X_train)
        
        # Posterior mean: μ* = K* @ (K + σₙ²I)⁻¹ @ y = K* @ α
        mean = K_star @ self.alpha
        
        if not return_std and not return_cov:
            return mean
        
        # For variance, we need L⁻¹ @ K*ᵀ
        # Solve L @ v = K*ᵀ for v
        v = solve_triangular(self.L, K_star.T, lower=True)
        
        if return_cov:
            # Full covariance: K** - K* @ (K + σₙ²I)⁻¹ @ K*ᵀ
            K_test = self.kernel(X_test, X_test)
            cov = K_test - v.T @ v
            return mean, cov
        
        if return_std:
            # Diagonal only (variance at each point)
            K_diag = self.signal_variance * np.ones(n_test)  # k(x*, x*) = σ²
            var = K_diag - np.sum(v**2, axis=0)
            # Ensure non-negative (numerical stability)
            var = np.maximum(var, 0)
            return mean, np.sqrt(var)
        
        return mean
    
    def sample_posterior(self, X_test, n_samples=5):
        """
        Draw samples from the posterior distribution.
        """
        mean, cov = self.predict(X_test, return_std=False, return_cov=True)
        
        # Add small jitter for numerical stability
        cov += 1e-8 * np.eye(len(X_test))
        
        # Sample from multivariate normal
        L = cholesky(cov, lower=True)
        samples = mean[:, np.newaxis] + L @ np.random.randn(len(X_test), n_samples)
        
        return samples
 
# Demonstration
np.random.seed(42)
 
# True function (unknown to the model)
def true_function(x):
    return np.sin(x) + 0.5 * np.sin(3*x)
 
# Generate training data with noise
n_train = 10
X_train = np.random.uniform(-5, 5, n_train).reshape(-1, 1)
y_train = true_function(X_train.ravel()) + 0.2 * np.random.randn(n_train)
 
# Fit GP
gp = GaussianProcessRegressor(length_scale=1.0, signal_variance=1.0, noise_variance=0.04)
gp.fit(X_train, y_train)
 
# Predict on dense grid
X_test = np.linspace(-6, 6, 200).reshape(-1, 1)
mean, std = gp.predict(X_test)
samples = gp.sample_posterior(X_test, n_samples=5)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Mean and uncertainty
ax = axes[0]
ax.fill_between(X_test.ravel(), mean - 2*std, mean + 2*std, 
                 alpha=0.2, color='C0', label='95% CI')
ax.fill_between(X_test.ravel(), mean - std, mean + std, 
                 alpha=0.3, color='C0', label='68% CI')
ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')
ax.plot(X_test, true_function(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training data')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('GP Regression: Posterior Mean and Uncertainty')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
# Plot 2: Posterior samples
ax = axes[1]
for i in range(5):
    ax.plot(X_test, samples[:, i], alpha=0.7, linewidth=1.5, 
            label=f'Sample {i+1}' if i < 3 else None)
ax.plot(X_test, true_function(X_test.ravel()), 'k--', linewidth=2, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training data')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Posterior Samples (consistent with data)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()

Key Insights from Posterior Behavior

Examining the posterior distribution reveals deep insights about how GPs combine prior knowledge with data.

Insight 1: Interpolation vs. Smoothing

The noise variance $\sigma_n^2$ controls whether the GP interpolates training points (passes exactly through them) or smooths over them:

$\sigma_n^2 = 0$: Interpolation. The posterior mean passes exactly through all training points. Uncertainty is zero at training points.
$\sigma_n^2 > 0$: Smoothing. The posterior mean may not pass through data points. Some uncertainty remains even at training locations.

Insight 2: Data Density Effects

Where training data is dense, the posterior mean is well-constrained and uncertainty is low. In regions with sparse or no data, predictions revert toward the prior with increased uncertainty.

This is exactly what we want! The GP "knows what it doesn't know"—uncertainty automatically grows where data is lacking.

posterior_insights.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_posterior(X_train, y_train, X_test, length_scale=1.0, 
                 signal_var=1.0, noise_var=0.1):
    """Compute GP posterior mean and std."""
    # Kernel function
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
# Demonstration of key insights
np.random.seed(42)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Demo 1: Interpolation vs Smoothing
X_train = np.array([[-2], [0], [2]])
y_train = np.array([1, 0, 1])
X_test = np.linspace(-4, 4, 200).reshape(-1, 1)
 
for noise_var, label, ax in [(0.001, 'Interpolation (σₙ²≈0)', axes[0,0]), 
                              (0.5, 'Smoothing (σₙ²=0.5)', axes[0,1])]:
    mean, std = gp_posterior(X_train, y_train, X_test, noise_var=noise_var)
    ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')
    ax.plot(X_test, mean, 'C0-', linewidth=2)
    ax.scatter(X_train, y_train, c='red', s=100, zorder=5)
    ax.set_title(label)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
# Demo 2: Data density effects
X_train_sparse = np.array([[-4], [4]])  # Only at edges
y_train_sparse = np.array([1, -1])
mean_sparse, std_sparse = gp_posterior(X_train_sparse, y_train_sparse, X_test, 
                                        noise_var=0.01, length_scale=2.0)
 
ax = axes[1, 0]
ax.fill_between(X_test.ravel(), mean_sparse-2*std_sparse, mean_sparse+2*std_sparse, 
                 alpha=0.2, color='C0', label='95% CI')
ax.plot(X_test, mean_sparse, 'C0-', linewidth=2, label='Mean')
ax.scatter(X_train_sparse, y_train_sparse, c='red', s=100, zorder=5, label='Data')
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.set_title('Sparse Data: High Uncertainty in Middle')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Demo 3: Uncertainty only depends on input locations
X_train_same = np.array([[-2], [0], [2]])
y_train_1 = np.array([0, 0, 0])  # Flat
y_train_2 = np.array([2, -1, 2])  # Curved
 
ax = axes[1, 1]
_, std_1 = gp_posterior(X_train_same, y_train_1, X_test, noise_var=0.01)
_, std_2 = gp_posterior(X_train_same, y_train_2, X_test, noise_var=0.01)
 
ax.plot(X_test, std_1, 'C0-', linewidth=2, label='y = [0, 0, 0]')
ax.plot(X_test, std_2, 'C1--', linewidth=2, label='y = [2, -1, 2]')
for x in X_train_same:
    ax.axvline(x.item(), color='red', linestyle=':', alpha=0.5)
ax.set_title('Std Dev: Same for Different y-values!')
ax.set_xlabel('x')
ax.set_ylabel('Posterior Std Dev')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('Key Posterior Insights', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Variance Independence

A remarkable property: the posterior variance depends only on the input locations X and x*, not on the observed values y. This means we can pre-compute uncertainty before seeing target values! This is useful for experimental design—we can identify where to sample next based on uncertainty, without needing to run experiments first.

Numerical Considerations

GP inference involves matrix operations that can be numerically unstable. Understanding these issues is crucial for robust implementations.

The Core Challenge: Matrix Inversion

The posterior equations require $(K + \sigma_n^2 I)^{-1}$. Directly computing matrix inverses is:

Numerically unstable: Small eigenvalues lead to large errors
Computationally expensive: $O(n^3)$ operations
Inaccurate for ill-conditioned matrices: When eigenvalue ratio is large

Solution: Cholesky Decomposition

Instead of inverting, we decompose the matrix: $$K + \sigma_n^2 I = LL^T$$

where $L$ is lower triangular. Then:

Solve $L\mathbf{z} = \mathbf{y}$ (forward substitution, $O(n^2)$)
Solve $L^T\boldsymbol{\alpha} = \mathbf{z}$ (backward substitution, $O(n^2)$)
Result: $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}$

This is faster and more numerically stable than explicit inversion.

Jitter for Numerical Stability

Even with Cholesky, the kernel matrix can become numerically singular when points are very close. Adding small "jitter" to the diagonal (e.g., 1e-8 × I) ensures positive definiteness. This is standard practice and has negligible impact on predictions. Libraries like GPyTorch add this automatically.

Common Numerical Issues and Solutions

•Cholesky Failure — The matrix isn't positive definite. Add larger jitter, or check for duplicate/nearly-identical points.
•Negative Variance — Numerical errors can yield small negative values. Clip to zero: np.maximum(var, 0).
•Overflow in Kernel — Very large distances can cause issues. Ensure inputs are reasonably scaled.
•Underflow in Kernel — Tiny kernel values cause numerical zeros. Not usually problematic but can affect condition number.
•Ill-conditioning — Large eigenvalue spread. Scale inputs, use appropriate hyperparameters, or add regularization.

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.linalg import cholesky, cho_solve, LinAlgError
 
def stable_gp_inference(K, y, noise_var, jitter=1e-8, max_jitter=1e-2):
    """
    Numerically stable GP inference with automatic jitter adjustment.
    
    Parameters:
    - K: Kernel matrix (n × n)
    - y: Observations (n,)
    - noise_var: Observation noise variance
    - jitter: Initial jitter value
    - max_jitter: Maximum jitter before giving up
    
    Returns:
    - L: Cholesky factor
    - alpha: (K + noise_var*I)^{-1} @ y
    """
    n = len(y)
    K_noisy = K + noise_var * np.eye(n)
    
    current_jitter = jitter
    while current_jitter < max_jitter:
        try:
            L = cholesky(K_noisy + current_jitter * np.eye(n), lower=True)
            alpha = cho_solve((L, True), y)
            
            if current_jitter > jitter:
                print(f"Warning: Required jitter = {current_jitter:.2e}")
            return L, alpha
            
        except LinAlgError:
            current_jitter *= 10
    
    raise LinAlgError(f"Cholesky failed even with jitter = {current_jitter}")
 
def compute_log_marginal_likelihood(L, y, alpha):
    """
    Compute log marginal likelihood (for hyperparameter optimization).
    
    log p(y|X,θ) = -½ yᵀ(K+σₙ²I)⁻¹y - ½ log|K+σₙ²I| - n/2 log(2π)
    
    Using Cholesky:
    - yᵀα gives y^T (K+σₙ²I)⁻¹ y  
    - 2 Σ log(L_ii) gives log|K+σₙ²I|
    """
    n = len(y)
    
    # Data fit term: -½ yᵀ α
    data_fit = -0.5 * y.T @ alpha
    
    # Complexity term: -½ log|K| = -Σ log(L_ii)
    complexity = -np.sum(np.log(np.diag(L)))
    
    # Constant term
    constant = -0.5 * n * np.log(2 * np.pi)
    
    return data_fit + complexity + constant
 
# Demonstration
np.random.seed(42)
 
# Generate data
n = 50
X = np.linspace(0, 10, n).reshape(-1, 1)
y = np.sin(X.ravel()) + 0.1 * np.random.randn(n)
 
# Build kernel matrix with potential numerical issues
length_scale = 0.1  # Very short length scale can create issues
signal_var = 1.0
sq_dist = (X - X.T)**2
K = signal_var * np.exp(-sq_dist / (2 * length_scale**2))
 
# Stable inference
L, alpha = stable_gp_inference(K, y, noise_var=0.01)
log_ml = compute_log_marginal_likelihood(L, y, alpha)
 
print(f"Log marginal likelihood: {log_ml:.2f}")
print(f"Cholesky factor condition number: {np.linalg.cond(L):.2e}")
 
# Verify positive definiteness
eigenvalues = np.linalg.eigvalsh(K + 0.01 * np.eye(n))
print(f"Smallest eigenvalue: {eigenvalues.min():.2e}")
print(f"Largest eigenvalue: {eigenvalues.max():.2e}")
print(f"Condition number: {eigenvalues.max() / eigenvalues.min():.2e}")

Prediction at Multiple Test Points

So far we've considered predicting at a single test point $x_$. When predicting at multiple test points $X_ = [x_^{(1)}, \ldots, x_^{(m)}]$, we can compute the joint posterior distribution:

$$\mathbf{f}* | \mathbf{y}, X, X* \sim \mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$$

where:

$$\boldsymbol{\mu}* = K*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}$$

$$\Sigma_* = K_{**} - K_^T (K + \sigma_n^2 I)^{-1} K_$$

Here:

$K_* = k(X, X_*)$ is $n \times m$ (cross-covariance between training and test)
$K_{**} = k(X_, X_)$ is $m \times m$ (prior covariance at test points)

Why Joint Predictions Matter:

The joint posterior captures correlations between test predictions. If two test points are nearby, their function values are correlated—knowing $f(x_^{(1)})$ tells us something about $f(x_^{(2)})$.

Individual marginal variances don't capture this information, but the joint covariance $\Sigma_*$ does. This is crucial for:

Sampling consistent functions (not just independent values)
Uncertainty in integrals or derivatives of the function
Decision making that depends on function behavior over a region

joint_predictions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_joint_posterior(X_train, y_train, X_test, length_scale=1.0, 
                        signal_var=1.0, noise_var=0.1):
    """
    Compute joint posterior distribution over test points.
    Returns mean vector and full covariance matrix.
    """
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    # Compute all kernel matrices
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)  # m × n
    K_starstar = kernel(X_test, X_test)  # m × m
    
    # Cholesky decomposition
    L = cholesky(K, lower=True)
    
    # Alpha = K^{-1} y
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    
    # v = L^{-1} K_star^T
    v = solve_triangular(L, K_star.T, lower=True)
    
    # Posterior mean and covariance
    mean = K_star @ alpha
    cov = K_starstar - v.T @ v
    
    # Ensure symmetric (numerical fix)
    cov = 0.5 * (cov + cov.T)
    
    return mean, cov
 
# Demonstration
np.random.seed(42)
 
# Training data
X_train = np.array([[-3], [0], [3]])
y_train = np.array([1, 0, 1])
 
# Test points
X_test = np.linspace(-5, 5, 100).reshape(-1, 1)
 
# Compute joint posterior
mean, cov = gp_joint_posterior(X_train, y_train, X_test, 
                                length_scale=1.5, noise_var=0.01)
 
# Sample from joint distribution
n_samples = 50
L_post = cholesky(cov + 1e-8 * np.eye(len(X_test)), lower=True)
samples = mean[:, np.newaxis] + L_post @ np.random.randn(len(X_test), n_samples)
 
# Also generate "independent" samples (ignoring correlation)
std = np.sqrt(np.diag(cov))
independent_samples = mean[:, np.newaxis] + std[:, np.newaxis] * np.random.randn(len(X_test), n_samples)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Joint samples (smooth, correlated)
ax = axes[0]
for i in range(10):
    ax.plot(X_test, samples[:, i], alpha=0.5, color='C0')
ax.scatter(X_train, y_train, c='red', s=100, zorder=5)
ax.set_title('Joint Posterior Samples
(Correlated, smooth functions)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Independent samples (jagged, not functions)
ax = axes[1]
for i in range(10):
    ax.plot(X_test, independent_samples[:, i], alpha=0.5, color='C1')
ax.scatter(X_train, y_train, c='red', s=100, zorder=5)
ax.set_title('Independent Marginal Samples
(Uncorrelated, not proper functions!)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
plt.suptitle('Joint vs Marginal Sampling', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
# Show correlation structure
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cov, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(im, ax=ax, label='Covariance')
ax.set_title('Posterior Covariance Matrix')
ax.set_xlabel('Test Point Index')
ax.set_ylabel('Test Point Index')
plt.show()

Joint vs Marginal Sampling

When visualizing posterior uncertainty, always sample from the joint distribution, not independent marginals. Independent marginal samples ignore correlations and produce jagged, non-function-like curves. Joint samples respect the GP's correlation structure and produce smooth functions consistent with the kernel.

Computational Complexity

Understanding GP computational complexity is essential for practical applications.

Training (Fitting):

The dominant cost is Cholesky decomposition of the $n \times n$ kernel matrix:

Time: $O(n^3)$
Memory: $O(n^2)$ to store $K$ and $L$

This cubic scaling is the primary limitation of exact GPs. For $n = 10,000$ points, you're decomposing a $10^8$-element matrix—feasible but slow. For $n = 100,000$, standard exact GPs become impractical.

Prediction (at $m$ test points):

Once Cholesky is computed:

Mean computation: $O(mn)$ — matrix-vector products
Variance (diagonal only): $O(mn)$ — solving triangular systems
Full covariance: $O(m^2 n)$ — additional matrix multiplications

GP Computational Requirements
Operation	Time Complexity	Memory	Bottleneck
Build kernel matrix	$O(n^2)$	$O(n^2)$	Pairwise distances
Cholesky decomposition	$O(n^3)$	$O(n^2)$	Matrix factorization
Solve for $\alpha$	$O(n^2)$	$O(n)$	Triangular solves
Predict mean (m points)	$O(mn)$	$O(mn)$	Cross-covariance
Predict variance (m points)	$O(mn)$	$O(mn)$	Triangular solve
Full posterior covariance	$O(m^2n)$	$O(m^2)$	Dense product
Sample from posterior	$O(m^3)$	$O(m^2)$	Test Cholesky

Scaling Strategies

The $O(n^3)$ barrier has motivated extensive research into scalable GP approximations: inducing point methods (FITC, VFE), random Fourier features, structured kernels, and GPU acceleration. We'll cover these in the 'Computational Aspects' module. For now, understand that exact GPs work well for thousands of points; beyond that requires approximations.

Summary: Posterior Inference

We've derived and implemented the core equations of Gaussian Process regression, understanding how conditioning on data transforms prior beliefs into predictions with calibrated uncertainty.

Key Takeaways

•Closed-Form Posterior — GP posterior has analytical solution: $\mu_* = \mathbf{k}*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}$ and $\sigma^2 = k_{**} - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$.
•Mean is Linear in Observations — The posterior mean is a weighted sum of training outputs, with weights determined by kernel correlations.
•Variance is Data-Independent — Posterior variance depends only on input locations, not observed values—enabling pre-computation of uncertainty.
•Noise Controls Interpolation — Zero noise gives exact interpolation; positive noise allows smoothing through data.
•Cholesky for Stability — Always use Cholesky decomposition rather than direct matrix inversion for numerical stability.
•Cubic Scaling — Training costs $O(n^3)$, limiting exact GPs to moderate data sizes without approximations.

What's Next:

With posterior inference mastered, we'll focus on mean and variance prediction—the practical outputs of GP regression. We'll examine how to interpret predictions, visualize uncertainty, and use GP outputs for downstream tasks like decision-making and active learning.

Page Complete

You now understand the mathematical heart of Gaussian Process regression: conditioning on data to obtain posterior distributions. The closed-form solutions make GPs remarkably elegant and computationally tractable for moderate-sized datasets. Next, we'll explore how to interpret and use these predictions effectively.

Posterior Inference

From Prior Beliefs to Data-Informed Predictions

What You Will Learn

The Conditioning Framework

The Bayesian Framework:

Prior: $f \sim \mathcal{GP}(m, k)$ — our beliefs before seeing data
Likelihood: $y_i | f(x_i) \sim \mathcal{N}(f(x_i), \sigma_n^2)$ — the observation model
Posterior: $f | \mathcal{D} \sim \mathcal{GP}(m_{post}, k_{post})$ — updated beliefs

The posterior is also a GP, thanks to the Gaussian's closure properties. Our task is to derive the posterior mean and covariance functions.

Setting Up the Joint Distribution:

Consider the joint distribution of the training observations $\mathbf{y} = [y_1, \ldots, y_n]^T$ and the test function value $f_* = f(x_*)$. Under our GP prior:

where:

$\mathbf{m} = [m(x_1), \ldots, m(x_n)]^T$ — prior mean at training points
$m_* = m(x_*)$ — prior mean at test point
$K$ is the $n \times n$ kernel matrix with $K_{ij} = k(x_i, x_j)$
$\mathbf{k}* = [k(x_1, x), \ldots, k(x_n, x_)]^T$ — cross-covariance
$k_{**} = k(x_, x_)$ — prior variance at test point

The Noise Term

Derivation of the Posterior

Now we apply the standard formula for conditioning a multivariate Gaussian. Given the joint distribution:

The conditional distribution of $\mathbf{b}$ given $\mathbf{a}$ is:

$$\mathbf{b} | \mathbf{a} \sim \mathcal{N}\left( \boldsymbol{\mu}b + \Sigma{ba}\Sigma_{aa}^{-1}(\mathbf{a} - \boldsymbol{\mu}a), ; \Sigma{bb} - \Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \right)$$

Applying to Our GP:

Identifying $\mathbf{a} = \mathbf{y}$ (observed) and $\mathbf{b} = f_*$ (to predict):

$$f_* | \mathbf{y}, X, x_* \sim \mathcal{N}(\mu_, \sigma_^2)$$

where:

$$\boxed{\mu_* = m(x_) + \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} (\mathbf{y} - \mathbf{m})}$$

$$\boxed{\sigma_^2 = k(x_, x_) - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*}$$

These are the fundamental equations of GP regression.

Interpreting the Posterior Mean

Understanding the Posterior Variance:

The posterior variance has an elegant interpretation:

$$\sigma_^2 = \underbrace{k(x_, x_)}{\text{prior variance}} - \underbrace{\mathbf{k}^T (K + \sigma_n^2 I)^{-1} \mathbf{k}*}{\text{variance reduction}}$$

Prior variance $k(x_, x_)$: Our uncertainty before seeing any data
Variance reduction: How much the observations reduce our uncertainty

Key properties of the posterior variance:

At training points: $\sigma_^2 \to 0$ as $x_ \to x_i$ (with $\sigma_n^2 = 0$)
Far from data: $\sigma_^2 \to k(x_, x_*)$ (reverts to prior)
Independent of observed values: Only depends on input locations!

Posterior Mean and Variance Properties
Property	Posterior Mean	Posterior Variance
Depends on observed $\mathbf{y}$?	Yes — linear in $\mathbf{y}$	No — only depends on $X, x_*$
At training point	Interpolates data (if $\sigma_n = 0$)	Zero (perfect knowledge)
Far from data	Reverts to prior mean	Reverts to prior variance
Complexity	$O(n^2)$ per test point	$O(n^2)$ per test point

Complete Implementation

Let's implement GP regression from scratch, carefully handling numerical stability and providing clear visualization.

gp_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
class GaussianProcessRegressor:
    """
    Gaussian Process Regressor with squared exponential kernel.
    
    Provides posterior mean and variance predictions with
    numerically stable implementations.
    """
    
    def __init__(self, length_scale=1.0, signal_variance=1.0, noise_variance=0.1):
        """
        Initialize GP hyperparameters.
        
        Parameters:
        - length_scale: Controls function smoothness
        - signal_variance: Controls function magnitude
        - noise_variance: Observation noise variance
        """
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.noise_variance = noise_variance
        
        # Will be set during fit
        self.X_train = None
        self.y_train = None
        self.L = None  # Cholesky factor
        self.alpha = None  # Precomputed weights
    
    def kernel(self, X1, X2):
        """
        Squared Exponential kernel.
        
        k(x, x') = σ² exp(-||x - x'||² / (2ℓ²))
        """
        # Compute squared distances efficiently
        X1 = np.atleast_2d(X1)
        X2 = np.atleast_2d(X2)
        
        # ||x - x'||² = ||x||² + ||x'||² - 2 x·x'
        X1_sq = np.sum(X1**2, axis=1).reshape(-1, 1)
        X2_sq = np.sum(X2**2, axis=1).reshape(1, -1)
        sq_dist = X1_sq + X2_sq - 2 * X1 @ X2.T
        
        # Ensure non-negative (numerical issues can cause tiny negatives)
        sq_dist = np.maximum(sq_dist, 0)
        
        return self.signal_variance * np.exp(-sq_dist / (2 * self.length_scale**2))
    
    def fit(self, X, y):
        """
        Fit the GP to training data.
        
        Precomputes quantities needed for prediction.
        """
        self.X_train = np.atleast_2d(X)
        self.y_train = y.ravel()
        n = len(self.y_train)
        
        # Compute kernel matrix
        K = self.kernel(self.X_train, self.X_train)
        
        # Add noise variance to diagonal
        K_noisy = K + self.noise_variance * np.eye(n)
        
        # Cholesky decomposition for numerical stability
        # K_noisy = L @ L.T
        self.L = cholesky(K_noisy, lower=True)
        
        # Solve L @ z = y for z, then L.T @ alpha = z for alpha
        # Equivalent to alpha = K_noisy^{-1} @ y
        self.alpha = solve_triangular(
            self.L.T, 
            solve_triangular(self.L, self.y_train, lower=True),
            lower=False
        )
        
        return self
    
    def predict(self, X_test, return_std=True, return_cov=False):
        """
        Predict at test points.
        
        Returns:
        - mean: Posterior mean at test points
        - std (optional): Posterior standard deviation
        - cov (optional): Full posterior covariance matrix
        """
        X_test = np.atleast_2d(X_test)
        n_test = len(X_test)
        
        # Cross-covariance between test and training points
        K_star = self.kernel(X_test, self.X_train)
        
        # Posterior mean: μ* = K* @ (K + σₙ²I)⁻¹ @ y = K* @ α
        mean = K_star @ self.alpha
        
        if not return_std and not return_cov:
            return mean
        
        # For variance, we need L⁻¹ @ K*ᵀ
        # Solve L @ v = K*ᵀ for v
        v = solve_triangular(self.L, K_star.T, lower=True)
        
        if return_cov:
            # Full covariance: K** - K* @ (K + σₙ²I)⁻¹ @ K*ᵀ
            K_test = self.kernel(X_test, X_test)
            cov = K_test - v.T @ v
            return mean, cov
        
        if return_std:
            # Diagonal only (variance at each point)
            K_diag = self.signal_variance * np.ones(n_test)  # k(x*, x*) = σ²
            var = K_diag - np.sum(v**2, axis=0)
            # Ensure non-negative (numerical stability)
            var = np.maximum(var, 0)
            return mean, np.sqrt(var)
        
        return mean
    
    def sample_posterior(self, X_test, n_samples=5):
        """
        Draw samples from the posterior distribution.
        """
        mean, cov = self.predict(X_test, return_std=False, return_cov=True)
        
        # Add small jitter for numerical stability
        cov += 1e-8 * np.eye(len(X_test))
        
        # Sample from multivariate normal
        L = cholesky(cov, lower=True)
        samples = mean[:, np.newaxis] + L @ np.random.randn(len(X_test), n_samples)
        
        return samples
 
# Demonstration
np.random.seed(42)
 
# True function (unknown to the model)
def true_function(x):
    return np.sin(x) + 0.5 * np.sin(3*x)
 
# Generate training data with noise
n_train = 10
X_train = np.random.uniform(-5, 5, n_train).reshape(-1, 1)
y_train = true_function(X_train.ravel()) + 0.2 * np.random.randn(n_train)
 
# Fit GP
gp = GaussianProcessRegressor(length_scale=1.0, signal_variance=1.0, noise_variance=0.04)
gp.fit(X_train, y_train)
 
# Predict on dense grid
X_test = np.linspace(-6, 6, 200).reshape(-1, 1)
mean, std = gp.predict(X_test)
samples = gp.sample_posterior(X_test, n_samples=5)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Mean and uncertainty
ax = axes[0]
ax.fill_between(X_test.ravel(), mean - 2*std, mean + 2*std, 
                 alpha=0.2, color='C0', label='95% CI')
ax.fill_between(X_test.ravel(), mean - std, mean + std, 
                 alpha=0.3, color='C0', label='68% CI')
ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')
ax.plot(X_test, true_function(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training data')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('GP Regression: Posterior Mean and Uncertainty')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
# Plot 2: Posterior samples
ax = axes[1]
for i in range(5):
    ax.plot(X_test, samples[:, i], alpha=0.7, linewidth=1.5, 
            label=f'Sample {i+1}' if i < 3 else None)
ax.plot(X_test, true_function(X_test.ravel()), 'k--', linewidth=2, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training data')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Posterior Samples (consistent with data)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()

Key Insights from Posterior Behavior

Examining the posterior distribution reveals deep insights about how GPs combine prior knowledge with data.

Insight 1: Interpolation vs. Smoothing

The noise variance $\sigma_n^2$ controls whether the GP interpolates training points (passes exactly through them) or smooths over them:

$\sigma_n^2 = 0$: Interpolation. The posterior mean passes exactly through all training points. Uncertainty is zero at training points.
$\sigma_n^2 > 0$: Smoothing. The posterior mean may not pass through data points. Some uncertainty remains even at training locations.

Insight 2: Data Density Effects

Where training data is dense, the posterior mean is well-constrained and uncertainty is low. In regions with sparse or no data, predictions revert toward the prior with increased uncertainty.

This is exactly what we want! The GP "knows what it doesn't know"—uncertainty automatically grows where data is lacking.

posterior_insights.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_posterior(X_train, y_train, X_test, length_scale=1.0, 
                 signal_var=1.0, noise_var=0.1):
    """Compute GP posterior mean and std."""
    # Kernel function
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
# Demonstration of key insights
np.random.seed(42)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Demo 1: Interpolation vs Smoothing
X_train = np.array([[-2], [0], [2]])
y_train = np.array([1, 0, 1])
X_test = np.linspace(-4, 4, 200).reshape(-1, 1)
 
for noise_var, label, ax in [(0.001, 'Interpolation (σₙ²≈0)', axes[0,0]), 
                              (0.5, 'Smoothing (σₙ²=0.5)', axes[0,1])]:
    mean, std = gp_posterior(X_train, y_train, X_test, noise_var=noise_var)
    ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')
    ax.plot(X_test, mean, 'C0-', linewidth=2)
    ax.scatter(X_train, y_train, c='red', s=100, zorder=5)
    ax.set_title(label)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
# Demo 2: Data density effects
X_train_sparse = np.array([[-4], [4]])  # Only at edges
y_train_sparse = np.array([1, -1])
mean_sparse, std_sparse = gp_posterior(X_train_sparse, y_train_sparse, X_test, 
                                        noise_var=0.01, length_scale=2.0)
 
ax = axes[1, 0]
ax.fill_between(X_test.ravel(), mean_sparse-2*std_sparse, mean_sparse+2*std_sparse, 
                 alpha=0.2, color='C0', label='95% CI')
ax.plot(X_test, mean_sparse, 'C0-', linewidth=2, label='Mean')
ax.scatter(X_train_sparse, y_train_sparse, c='red', s=100, zorder=5, label='Data')
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.set_title('Sparse Data: High Uncertainty in Middle')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Demo 3: Uncertainty only depends on input locations
X_train_same = np.array([[-2], [0], [2]])
y_train_1 = np.array([0, 0, 0])  # Flat
y_train_2 = np.array([2, -1, 2])  # Curved
 
ax = axes[1, 1]
_, std_1 = gp_posterior(X_train_same, y_train_1, X_test, noise_var=0.01)
_, std_2 = gp_posterior(X_train_same, y_train_2, X_test, noise_var=0.01)
 
ax.plot(X_test, std_1, 'C0-', linewidth=2, label='y = [0, 0, 0]')
ax.plot(X_test, std_2, 'C1--', linewidth=2, label='y = [2, -1, 2]')
for x in X_train_same:
    ax.axvline(x.item(), color='red', linestyle=':', alpha=0.5)
ax.set_title('Std Dev: Same for Different y-values!')
ax.set_xlabel('x')
ax.set_ylabel('Posterior Std Dev')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('Key Posterior Insights', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Variance Independence

Numerical Considerations

GP inference involves matrix operations that can be numerically unstable. Understanding these issues is crucial for robust implementations.

The Core Challenge: Matrix Inversion

The posterior equations require $(K + \sigma_n^2 I)^{-1}$. Directly computing matrix inverses is:

Numerically unstable: Small eigenvalues lead to large errors
Computationally expensive: $O(n^3)$ operations
Inaccurate for ill-conditioned matrices: When eigenvalue ratio is large

Solution: Cholesky Decomposition

Instead of inverting, we decompose the matrix: $$K + \sigma_n^2 I = LL^T$$

where $L$ is lower triangular. Then:

Solve $L\mathbf{z} = \mathbf{y}$ (forward substitution, $O(n^2)$)
Solve $L^T\boldsymbol{\alpha} = \mathbf{z}$ (backward substitution, $O(n^2)$)
Result: $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}$

This is faster and more numerically stable than explicit inversion.

Jitter for Numerical Stability

Common Numerical Issues and Solutions

•Cholesky Failure — The matrix isn't positive definite. Add larger jitter, or check for duplicate/nearly-identical points.
•Negative Variance — Numerical errors can yield small negative values. Clip to zero: np.maximum(var, 0).
•Overflow in Kernel — Very large distances can cause issues. Ensure inputs are reasonably scaled.
•Underflow in Kernel — Tiny kernel values cause numerical zeros. Not usually problematic but can affect condition number.
•Ill-conditioning — Large eigenvalue spread. Scale inputs, use appropriate hyperparameters, or add regularization.

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.linalg import cholesky, cho_solve, LinAlgError
 
def stable_gp_inference(K, y, noise_var, jitter=1e-8, max_jitter=1e-2):
    """
    Numerically stable GP inference with automatic jitter adjustment.
    
    Parameters:
    - K: Kernel matrix (n × n)
    - y: Observations (n,)
    - noise_var: Observation noise variance
    - jitter: Initial jitter value
    - max_jitter: Maximum jitter before giving up
    
    Returns:
    - L: Cholesky factor
    - alpha: (K + noise_var*I)^{-1} @ y
    """
    n = len(y)
    K_noisy = K + noise_var * np.eye(n)
    
    current_jitter = jitter
    while current_jitter < max_jitter:
        try:
            L = cholesky(K_noisy + current_jitter * np.eye(n), lower=True)
            alpha = cho_solve((L, True), y)
            
            if current_jitter > jitter:
                print(f"Warning: Required jitter = {current_jitter:.2e}")
            return L, alpha
            
        except LinAlgError:
            current_jitter *= 10
    
    raise LinAlgError(f"Cholesky failed even with jitter = {current_jitter}")
 
def compute_log_marginal_likelihood(L, y, alpha):
    """
    Compute log marginal likelihood (for hyperparameter optimization).
    
    log p(y|X,θ) = -½ yᵀ(K+σₙ²I)⁻¹y - ½ log|K+σₙ²I| - n/2 log(2π)
    
    Using Cholesky:
    - yᵀα gives y^T (K+σₙ²I)⁻¹ y  
    - 2 Σ log(L_ii) gives log|K+σₙ²I|
    """
    n = len(y)
    
    # Data fit term: -½ yᵀ α
    data_fit = -0.5 * y.T @ alpha
    
    # Complexity term: -½ log|K| = -Σ log(L_ii)
    complexity = -np.sum(np.log(np.diag(L)))
    
    # Constant term
    constant = -0.5 * n * np.log(2 * np.pi)
    
    return data_fit + complexity + constant
 
# Demonstration
np.random.seed(42)
 
# Generate data
n = 50
X = np.linspace(0, 10, n).reshape(-1, 1)
y = np.sin(X.ravel()) + 0.1 * np.random.randn(n)
 
# Build kernel matrix with potential numerical issues
length_scale = 0.1  # Very short length scale can create issues
signal_var = 1.0
sq_dist = (X - X.T)**2
K = signal_var * np.exp(-sq_dist / (2 * length_scale**2))
 
# Stable inference
L, alpha = stable_gp_inference(K, y, noise_var=0.01)
log_ml = compute_log_marginal_likelihood(L, y, alpha)
 
print(f"Log marginal likelihood: {log_ml:.2f}")
print(f"Cholesky factor condition number: {np.linalg.cond(L):.2e}")
 
# Verify positive definiteness
eigenvalues = np.linalg.eigvalsh(K + 0.01 * np.eye(n))
print(f"Smallest eigenvalue: {eigenvalues.min():.2e}")
print(f"Largest eigenvalue: {eigenvalues.max():.2e}")
print(f"Condition number: {eigenvalues.max() / eigenvalues.min():.2e}")

Prediction at Multiple Test Points

So far we've considered predicting at a single test point $x_$. When predicting at multiple test points $X_ = [x_^{(1)}, \ldots, x_^{(m)}]$, we can compute the joint posterior distribution:

$$\mathbf{f}* | \mathbf{y}, X, X* \sim \mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$$

where:

$$\boldsymbol{\mu}* = K*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}$$

$$\Sigma_* = K_{**} - K_^T (K + \sigma_n^2 I)^{-1} K_$$

Here:

$K_* = k(X, X_*)$ is $n \times m$ (cross-covariance between training and test)
$K_{**} = k(X_, X_)$ is $m \times m$ (prior covariance at test points)

Why Joint Predictions Matter:

Individual marginal variances don't capture this information, but the joint covariance $\Sigma_*$ does. This is crucial for:

Sampling consistent functions (not just independent values)
Uncertainty in integrals or derivatives of the function
Decision making that depends on function behavior over a region

joint_predictions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_joint_posterior(X_train, y_train, X_test, length_scale=1.0, 
                        signal_var=1.0, noise_var=0.1):
    """
    Compute joint posterior distribution over test points.
    Returns mean vector and full covariance matrix.
    """
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    # Compute all kernel matrices
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)  # m × n
    K_starstar = kernel(X_test, X_test)  # m × m
    
    # Cholesky decomposition
    L = cholesky(K, lower=True)
    
    # Alpha = K^{-1} y
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    
    # v = L^{-1} K_star^T
    v = solve_triangular(L, K_star.T, lower=True)
    
    # Posterior mean and covariance
    mean = K_star @ alpha
    cov = K_starstar - v.T @ v
    
    # Ensure symmetric (numerical fix)
    cov = 0.5 * (cov + cov.T)
    
    return mean, cov
 
# Demonstration
np.random.seed(42)
 
# Training data
X_train = np.array([[-3], [0], [3]])
y_train = np.array([1, 0, 1])
 
# Test points
X_test = np.linspace(-5, 5, 100).reshape(-1, 1)
 
# Compute joint posterior
mean, cov = gp_joint_posterior(X_train, y_train, X_test, 
                                length_scale=1.5, noise_var=0.01)
 
# Sample from joint distribution
n_samples = 50
L_post = cholesky(cov + 1e-8 * np.eye(len(X_test)), lower=True)
samples = mean[:, np.newaxis] + L_post @ np.random.randn(len(X_test), n_samples)
 
# Also generate "independent" samples (ignoring correlation)
std = np.sqrt(np.diag(cov))
independent_samples = mean[:, np.newaxis] + std[:, np.newaxis] * np.random.randn(len(X_test), n_samples)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Joint samples (smooth, correlated)
ax = axes[0]
for i in range(10):
    ax.plot(X_test, samples[:, i], alpha=0.5, color='C0')
ax.scatter(X_train, y_train, c='red', s=100, zorder=5)
ax.set_title('Joint Posterior Samples
(Correlated, smooth functions)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Independent samples (jagged, not functions)
ax = axes[1]
for i in range(10):
    ax.plot(X_test, independent_samples[:, i], alpha=0.5, color='C1')
ax.scatter(X_train, y_train, c='red', s=100, zorder=5)
ax.set_title('Independent Marginal Samples
(Uncorrelated, not proper functions!)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
plt.suptitle('Joint vs Marginal Sampling', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
# Show correlation structure
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cov, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(im, ax=ax, label='Covariance')
ax.set_title('Posterior Covariance Matrix')
ax.set_xlabel('Test Point Index')
ax.set_ylabel('Test Point Index')
plt.show()

Joint vs Marginal Sampling

Computational Complexity

Understanding GP computational complexity is essential for practical applications.

Training (Fitting):

The dominant cost is Cholesky decomposition of the $n \times n$ kernel matrix:

Time: $O(n^3)$
Memory: $O(n^2)$ to store $K$ and $L$

Prediction (at $m$ test points):

Once Cholesky is computed:

Mean computation: $O(mn)$ — matrix-vector products
Variance (diagonal only): $O(mn)$ — solving triangular systems
Full covariance: $O(m^2 n)$ — additional matrix multiplications

GP Computational Requirements
Operation	Time Complexity	Memory	Bottleneck
Build kernel matrix	$O(n^2)$	$O(n^2)$	Pairwise distances
Cholesky decomposition	$O(n^3)$	$O(n^2)$	Matrix factorization
Solve for $\alpha$	$O(n^2)$	$O(n)$	Triangular solves
Predict mean (m points)	$O(mn)$	$O(mn)$	Cross-covariance
Predict variance (m points)	$O(mn)$	$O(mn)$	Triangular solve
Full posterior covariance	$O(m^2n)$	$O(m^2)$	Dense product
Sample from posterior	$O(m^3)$	$O(m^2)$	Test Cholesky

Scaling Strategies

Summary: Posterior Inference

We've derived and implemented the core equations of Gaussian Process regression, understanding how conditioning on data transforms prior beliefs into predictions with calibrated uncertainty.

Key Takeaways

•Closed-Form Posterior — GP posterior has analytical solution: $\mu_* = \mathbf{k}*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}$ and $\sigma^2 = k_{**} - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$.
•Mean is Linear in Observations — The posterior mean is a weighted sum of training outputs, with weights determined by kernel correlations.
•Variance is Data-Independent — Posterior variance depends only on input locations, not observed values—enabling pre-computation of uncertainty.
•Noise Controls Interpolation — Zero noise gives exact interpolation; positive noise allows smoothing through data.
•Cholesky for Stability — Always use Cholesky decomposition rather than direct matrix inversion for numerical stability.
•Cubic Scaling — Training costs $O(n^3)$, limiting exact GPs to moderate data sizes without approximations.

What's Next:

Page Complete