Machine LearningGaussian Processes

Gaussian Process Fundamentals

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Samples from GP Prior

Visualizing What Your Prior Believes

A Gaussian Process prior is an abstract mathematical object—a distribution over an infinite-dimensional function space. How can we understand what it actually represents? The answer is sampling.

By drawing samples from a GP prior, we generate concrete functions that are 'typical' according to our prior beliefs. These samples reveal:

What kinds of functions our kernel considers plausible
How hyperparameters affect function behavior
Whether our prior is appropriate for the problem at hand

The fundamental insight: Before fitting any data, you should sample from your prior and ask: 'Do these functions look like reasonable explanations for my data?' If not, your prior is misspecified, and it's time to revise your kernel choices.

What You Will Learn

By the end of this page, you will master the mathematical and computational methods for sampling from GP priors, understand how to use prior samples for model criticism and hyperparameter intuition, and develop a practical workflow for validating GP models before training.

The Sampling Algorithm

Sampling from a GP at a finite set of points reduces to sampling from a multivariate Gaussian. Here's the complete procedure:

Given: GP with mean function $m(\mathbf{x})$ and kernel $k(\mathbf{x}, \mathbf{x}')$

Goal: Sample function values at points $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$

Algorithm:

Compute mean vector: $\boldsymbol{\mu} = [m(\mathbf{x}_1), \ldots, m(\mathbf{x}_n)]^\top$
Compute covariance matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Compute Cholesky decomposition: $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower-triangular
Sample standard Gaussian: $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Transform: $\mathbf{f} = \boldsymbol{\mu} + \mathbf{L} \mathbf{z}$

The resulting $\mathbf{f}$ is a sample from $\mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$—exactly the GP evaluated at points $\mathbf{X}$.

Why Cholesky?

The Cholesky decomposition $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ provides a 'square root' of the covariance matrix. If $\mathbf{z}$ has identity covariance, then:

$$\text{Cov}[\mathbf{L} \mathbf{z}] = \mathbf{L} \text{Cov}[\mathbf{z}] \mathbf{L}^\top = \mathbf{L} \mathbf{I} \mathbf{L}^\top = \mathbf{K}$$

So $\mathbf{L} \mathbf{z}$ has exactly the covariance we need.

Numerical Considerations:

Add a small 'jitter' term to the diagonal: $\mathbf{K} + \epsilon \mathbf{I}$ with $\epsilon \approx 10^{-8}$ to prevent numerical issues when $\mathbf{K}$ is near-singular
Cholesky is $O(n^3)$ and $O(n^2)$ memory—limits practical sample size
For very large $n$, approximate sampling methods exist (random Fourier features, etc.)

gp_prior_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
class GPPrior:
    """
    A Gaussian Process prior for sampling and visualization.
    """
    
    def __init__(self, mean_fn, kernel_fn, jitter=1e-8):
        """
        Args:
            mean_fn: m(X) -> array of mean values
            kernel_fn: k(X1, X2) -> covariance matrix
            jitter: small value added to diagonal for numerical stability
        """
        self.mean_fn = mean_fn
        self.kernel_fn = kernel_fn
        self.jitter = jitter
    
    def sample(self, X, n_samples=1, seed=None):
        """
        Sample function values from GP prior at points X.
        
        Args:
            X: (n,) or (n, d) array of input points
            n_samples: number of independent samples to draw
            seed: random seed for reproducibility
            
        Returns:
            (n_samples, n) array of function values
        """
        if seed is not None:
            np.random.seed(seed)
        
        n = len(X)
        
        # Step 1: Compute mean vector
        mu = self.mean_fn(X)
        
        # Step 2: Compute covariance matrix with jitter
        K = self.kernel_fn(X, X) + self.jitter * np.eye(n)
        
        # Step 3: Cholesky decomposition
        L = cholesky(K, lower=True)
        
        # Step 4 & 5: Sample and transform
        samples = []
        for _ in range(n_samples):
            z = np.random.randn(n)  # Standard Gaussian
            f = mu + L @ z          # Transform to GP sample
            samples.append(f)
        
        return np.array(samples)
 
# Define kernels
def zero_mean(X):
    return np.zeros(len(X))
 
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    sqdist = np.subtract.outer(X1, X2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
# Create GP prior
gp = GPPrior(
    mean_fn=zero_mean,
    kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0, variance=1.0)
)
 
# Sample at many points for smooth visualization
X = np.linspace(0, 10, 200)
samples = gp.sample(X, n_samples=5, seed=42)
 
# Visualize
plt.figure(figsize=(12, 6))
for i, sample in enumerate(samples):
    plt.plot(X, sample, label=f'Sample {i+1}', alpha=0.8)
 
# Show mean and ±2σ region
mu = zero_mean(X)
K = rbf_kernel(X, X, length_scale=1.0, variance=1.0)
std = np.sqrt(np.diag(K))
plt.fill_between(X, mu - 2*std, mu + 2*std, alpha=0.15, color='gray',
                  label='±2σ region')
plt.plot(X, mu, 'k--', alpha=0.5, label='Prior mean')
 
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Samples from GP Prior with RBF Kernel (ℓ=1.0)', fontsize=14)
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
 
print(f"Sampled {len(samples)} functions at {len(X)} points each.")
print("Each curve is one 'draw' from the GP prior distribution.")

Interpreting Prior Samples

Prior samples provide direct visual intuition for what your GP believes before seeing any data. Here's how to interpret them:

Sample Smoothness:

RBF samples are ultra-smooth—no sharp corners or kinks
Matérn 3/2 samples have some roughness—sharp changes possible
Matérn 1/2 (Exponential) samples are jagged—continuous but not differentiable

Sample Amplitude:

The signal variance hyperparameter $\sigma_f^2$ controls the typical magnitude
Expect samples to lie within $\pm 2\sigma_f$ roughly 95% of the time

Sample Length Scale:

Large $\ell$: samples vary slowly—long, gentle undulations
Small $\ell$: samples vary rapidly—many wiggles over short distances
The 'characteristic wavelength' of variation scales with $\ell$

What to Look For in Prior Samples

•Plausibility: Do these functions look like they could explain your data? If your data is periodic but samples are non-periodic, add a periodic component.
•Variability: Is the range of variation appropriate? If samples oscillate between ±0.1 but your data ranges over ±100, adjust $\sigma_f^2$.
•Length scale match: Does the 'wiggliness' match your data? If samples are smoother than data suggests, decrease $\ell$.
•Extrapolation behavior: What do samples do at the domain edges? For RBF, they approach zero mean. Is this sensible for your problem?
•Feature coverage: Do samples exhibit all the behaviors you expect in the data (trends, periodicity, heteroscedasticity)?

Prior Predictive Checking

This process of sampling from the prior and visually checking is called 'prior predictive checking.' It's a cornerstone of Bayesian workflow—validate that your assumptions are reasonable before conditioning on data. If samples look implausible, revise your kernel or hyperparameters. The data should narrow down an already-sensible prior, not rescue a bad one.

Hyperparameter Effects on Sample Behavior

Each kernel hyperparameter affects samples in specific, predictable ways. Understanding these effects helps you set sensible initial values and interpret learned hyperparameters.

Signal Variance $\sigma_f^2$:

This scales the overall amplitude of variation. Mathematically, $k(\mathbf{x}, \mathbf{x}) = \sigma_f^2$, so:

$\sigma_f^2 = 1$: samples typically range $\pm 2$
$\sigma_f^2 = 100$: samples typically range $\pm 20$

Length Scale $\ell$:

This controls the 'correlation distance'—roughly how far you can extrapolate before uncertainty increases:

Large $\ell$: function values at distant points are correlated; samples are smooth
Small $\ell$: correlations decay quickly; samples are wiggly

Rule of thumb: set $\ell$ to the distance over which you expect significant variation in your function.

hyperparameter_effects.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    sqdist = np.subtract.outer(X1, X2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
def sample_gp(X, kernel_fn, n_samples=3, seed=42):
    np.random.seed(seed)
    K = kernel_fn(X, X) + 1e-8 * np.eye(len(X))
    L = cholesky(K, lower=True)
    return np.array([L @ np.random.randn(len(X)) for _ in range(n_samples)])
 
X = np.linspace(0, 10, 200)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Vary length scale (top row)
length_scales = [0.3, 1.0, 3.0]
for idx, ls in enumerate(length_scales):
    ax = axes[0, idx]
    samples = sample_gp(X, lambda X1, X2: rbf_kernel(X1, X2, ls, 1.0))
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(f'Length Scale ℓ = {ls}', fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim([-4, 4])
    ax.grid(True, alpha=0.3)
 
# Vary signal variance (bottom row)  
variances = [0.25, 1.0, 4.0]
for idx, var in enumerate(variances):
    ax = axes[1, idx]
    samples = sample_gp(X, lambda X1, X2: rbf_kernel(X1, X2, 1.0, var))
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(f'Signal Variance σ² = {var}', fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim([-6, 6])
    ax.grid(True, alpha=0.3)
    
    # Show ±2σ region
    std = np.sqrt(var)
    ax.axhline(y=2*std, color='r', linestyle='--', alpha=0.5)
    ax.axhline(y=-2*std, color='r', linestyle='--', alpha=0.5)
 
plt.suptitle('Effect of RBF Kernel Hyperparameters on Prior Samples', fontsize=14)
plt.tight_layout()
plt.show()
 
print("Top row: Length scale affects 'wiggliness' (smaller = more wiggly)")
print("Bottom row: Signal variance affects amplitude (red lines show ±2σ)")

Summary of Hyperparameter Effects
Hyperparameter	Effect When Increased	Effect When Decreased
Length scale $\ell$	Smoother, slower variation	Rougher, faster variation
Signal variance $\sigma_f^2$	Larger amplitude swings	Smaller amplitude swings
Matérn $\nu$	Smoother samples	Rougher samples
Period $p$	Longer repetition cycle	Shorter repetition cycle
ARD length scales	Dimension becomes irrelevant	Dimension becomes highly relevant

Sampling from Composed Kernels

When you combine kernels through addition or multiplication, the resulting samples reflect the combined properties. Visualizing these samples is crucial for understanding what complex kernels encode.

Addition ($k = k_1 + k_2$):

Samples from $k = k_1 + k_2$ are sums of independent samples from $k_1$ and $k_2$:

If $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ independently
Then $f_1 + f_2 \sim \mathcal{GP}(0, k_1 + k_2)$

Interpretation: Samples exhibit features from both components superimposed.

Multiplication ($k = k_1 \cdot k_2$):

Multiplication is more complex—it's not about independent addition:

Samples from $k_1 \cdot k_2$ exhibit properties of both kernels interacting
Often, one kernel 'modulates' the other

Example: Periodic × RBF creates periodic functions that decay in amplitude.

composed_kernel_samples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(X1, X2, ls=1.0, var=1.0):
    return var * np.exp(-0.5 * np.subtract.outer(X1, X2)**2 / ls**2)
 
def periodic_kernel(X1, X2, period=1.0, ls=1.0, var=1.0):
    diff = np.subtract.outer(X1, X2)
    return var * np.exp(-2 * np.sin(np.pi * np.abs(diff) / period)**2 / ls**2)
 
def linear_kernel(X1, X2, var=1.0, bias=1.0):
    return bias + var * np.outer(X1, X2)
 
def sample_gp(X, K_fn, n_samples=3, seed=42):
    np.random.seed(seed)
    K = K_fn(X, X) + 1e-8 * np.eye(len(X))
    L = cholesky(K, lower=True)
    return np.array([L @ np.random.randn(len(X)) for _ in range(n_samples)])
 
X = np.linspace(0, 15, 300)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Row 1: Component kernels
kernels_1 = [
    ("RBF (ℓ=2)", lambda X1, X2: rbf_kernel(X1, X2, ls=2.0)),
    ("Periodic (p=3)", lambda X1, X2: periodic_kernel(X1, X2, period=3.0, ls=0.5)),
    ("Linear", lambda X1, X2: linear_kernel(X1, X2, var=0.1, bias=0.1)),
]
 
for idx, (name, k_fn) in enumerate(kernels_1):
    samples = sample_gp(X, k_fn)
    ax = axes[0, idx]
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(name, fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
# Row 2: Composed kernels
kernels_2 = [
    ("RBF + Periodic", lambda X1, X2: rbf_kernel(X1, X2, ls=4.0) + 
     periodic_kernel(X1, X2, period=3.0, ls=0.5, var=0.5)),
    ("Periodic × RBF (decay)", lambda X1, X2: periodic_kernel(X1, X2, period=2.0, ls=0.5) * 
     rbf_kernel(X1, X2, ls=5.0)),
    ("Linear × Periodic", lambda X1, X2: linear_kernel(X1, X2, var=0.05, bias=0.01) * 
     periodic_kernel(X1, X2, period=2.0, ls=0.5)),
]
 
for idx, (name, k_fn) in enumerate(kernels_2):
    samples = sample_gp(X, k_fn)
    ax = axes[1, idx]
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(name, fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Kernel Composition: Components (top) and Combinations (bottom)', 
             fontsize=14)
plt.tight_layout()
plt.show()
 
print("Addition superposes components; multiplication creates interaction.")

Composition Intuition

Addition: 'My function is the sum of component A and component B.' Use for independent decomposable structure (trend + seasonal + noise). Multiplication: 'Component A modulates how component B behaves.' Use for interacting effects (decaying periodicity, varying variance, growing/shrinking patterns).

Reproducibility and Random Seeds

For scientific work and debugging, reproducible sampling is essential. Here's how to ensure consistent results.

Random Seed Setting:

np.random.seed(42)  # Before any sampling
samples = gp.sample(X, n_samples=5)

With a fixed seed, the same samples are generated every time. This enables:

Reproducible figures for papers
Consistent debugging
Fair comparisons between methods

Generating a Family of Diverse Samples:

for trial in range(10):
    np.random.seed(trial)  # Different seed each trial
    samples = gp.sample(X)
    # Process results...

Thread Safety: In parallel code, be careful—multiple threads sharing a random state can produce non-reproducible results. Use thread-local random generators or explicitly pass random state objects.

Best Practice for Research

Always set and record your random seed at the start of experiments. Include it in supplementary materials for papers. Many subtle bugs and 'irreproducible results' stem from forgotten or inconsistent random seeds. Modern libraries (NumPy ≥1.17, PyTorch) offer explicit random state management via Generator objects—prefer these over global state.

Computational Considerations for Sampling

Sampling from a GP at $n$ points requires $O(n^3)$ time (for Cholesky) and $O(n^2)$ memory (for the kernel matrix). This limits the number of visualization points.

Practical Limits:

$n = 100$: Instantaneous
$n = 1,000$: A second or two
$n = 10,000$: Minutes to tens of minutes
$n = 100,000$: Infeasible with exact methods

Strategies for Dense Visualization:

Subsample then interpolate: Sample at sparse points, use GP posterior mean to fill in (but this changes the character of samples)
Iterative refinement: Start with coarse grid, add points where samples vary rapidly
Approximate methods:
- Random Fourier Features: $O(nm^2)$ where $m$ is number of features
- Circulant approximations: $O(n \log n)$ for gridded data with stationary kernels

Efficient Sampling with Multiple Independent Samples:

Once you've computed the Cholesky decomposition $\mathbf{L}$, drawing additional samples is cheap—just $O(n^2)$ per sample:

L = cholesky(K, lower=True)  # O(n³) once
samples = [L @ np.random.randn(n) for _ in range(100)]  # O(n²) each

Caching: If you're exploring hyperparameters interactively, cache Cholesky factors for kernels you've already computed.

Numerical Stability: Near-singular kernel matrices (when points are very close or the kernel is ill-conditioned) can cause Cholesky to fail. Solutions:

Add larger jitter: $\mathbf{K} + 10^{-6} \mathbf{I}$
Use pivoted Cholesky or eigendecomposition (more stable but slower)
Reconsider whether your kernel is appropriate for this data configuration

Sampling Complexity Summary
Operation	Time Complexity	Memory
Kernel matrix construction	$O(n^2)$	$O(n^2)$
Cholesky decomposition	$O(n^3)$	$O(n^2)$
Single sample (given L)	$O(n^2)$	$O(n)$
M samples (given L)	$O(Mn^2)$	$O(n^2 + Mn)$
RFF approximation	$O(nm^2)$	$O(nm)$

Sampling from Multi-Output GPs

In many applications, we need to model multiple correlated outputs simultaneously. Multi-output GPs (MOGPs) extend the basic GP framework to handle this.

The Setup: We have $P$ output functions $f_1, \ldots, f_P$, each evaluated at potentially different input sets. The full output vector is:

$$\mathbf{f} = [f_1(\mathbf{X}_1), f_2(\mathbf{X}_2), \ldots, f_P(\mathbf{X}_P)]$$

The Multi-Output Kernel: A kernel for MOGPs specifies covariances between outputs at different inputs:

$$k((\mathbf{x}, p), (\mathbf{x}', p')) = k_{\text{input}}(\mathbf{x}, \mathbf{x}') \cdot k_{\text{output}}(p, p')$$

The simplest form is the Intrinsic Coregionalization Model (ICM): $$\mathbf{K}{\text{MOGP}} = \mathbf{B} \otimes \mathbf{K}{\text{input}}$$

where $\mathbf{B}$ is a $P \times P$ positive semi-definite matrix encoding output correlations, and $\otimes$ is the Kronecker product.

multi_output_gp_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(X1, X2, ls=1.0, var=1.0):
    return var * np.exp(-0.5 * np.subtract.outer(X1, X2)**2 / ls**2)
 
def sample_mogp(X, K_input, B, n_samples=3, seed=42):
    """
    Sample from Multi-Output GP using ICM model.
    
    K_MOGP = B ⊗ K_input
    
    Args:
        X: input points (n,)
        K_input: input kernel matrix (n, n)
        B: output covariance matrix (P, P)
        
    Returns:
        (n_samples, P, n) array of function values
    """
    np.random.seed(seed)
    n = len(X)
    P = B.shape[0]
    
    # Build full covariance via Kronecker product
    K_full = np.kron(B, K_input) + 1e-8 * np.eye(P * n)
    L = cholesky(K_full, lower=True)
    
    samples = []
    for _ in range(n_samples):
        z = np.random.randn(P * n)
        f = L @ z
        f = f.reshape(P, n)  # Reshape to (outputs, inputs)
        samples.append(f)
    
    return np.array(samples)
 
# Define correlated outputs
X = np.linspace(0, 10, 100)
K_input = rbf_kernel(X, X, ls=2.0)
 
# Output correlation matrix (3 outputs)
# Positive correlations between 1-2, negative between 1-3
B = np.array([
    [1.0,  0.7, -0.5],
    [0.7,  1.0,  0.2],
    [-0.5, 0.2,  1.0]
])
 
# Sample
samples = sample_mogp(X, K_input, B, n_samples=3)
 
# Visualize
fig, axes = plt.subplots(3, 1, figsize=(12, 8), sharex=True)
colors = ['tab:blue', 'tab:orange', 'tab:green']
labels = ['Output 1', 'Output 2 (corr. with 1)', 'Output 3 (anti-corr. with 1)']
 
for sample_idx in range(3):
    sample = samples[sample_idx]
    for output_idx in range(3):
        ax = axes[output_idx]
        alpha = 0.9 if sample_idx == 0 else 0.5
        ax.plot(X, sample[output_idx], color=colors[output_idx], 
                alpha=alpha, linewidth=1.5)
        if sample_idx == 0:
            ax.set_ylabel(labels[output_idx])
            ax.grid(True, alpha=0.3)
 
axes[2].set_xlabel('x')
fig.suptitle('Multi-Output GP Samples (Correlated Outputs)', fontsize=14)
plt.tight_layout()
plt.show()
 
print("Notice how outputs 1 and 2 tend to move together (positive correlation)")
print("While outputs 1 and 3 tend to move oppositely (negative correlation)")

Pathological Cases and Debugging

Sometimes GP sampling goes wrong. Here are common failure modes and how to diagnose them:

1. Cholesky Decomposition Fails

Symptom: LinAlgError: Matrix is not positive definite

Causes:

Insufficient jitter
Numerically identical input points
Ill-conditioned kernel (e.g., very long length scale compared to input range)

Fixes:

Increase jitter: K + 1e-4 * I
Remove duplicate/near-duplicate inputs
Rescale inputs to be O(1)

2. Samples Look 'Wrong' (All Flat, Exploding, etc.)

Symptom: Samples are constant, or grow unboundedly, or have artifacts.

Causes:

Length scale too large → samples nearly constant
Length scale too small → random noise (if grid too coarse)
Variance mismatch with mean function

Fixes:

Rescale hyperparameters
Use denser grid for visualization
Check mean function implementation

Debugging Checklist

•Verify kernel produces correct values: Check that $k(\mathbf{x}, \mathbf{x}) = \sigma_f^2$ (the diagonal should equal signal variance)
•Inspect eigenvalues: np.linalg.eigvalsh(K) should all be non-negative. Large spread indicates ill-conditioning.
•Check for NaN/Inf: Before Cholesky, verify np.all(np.isfinite(K))
•Visualize the kernel matrix: plt.imshow(K) reveals structure. Should be smooth, symmetric; artifacts indicate bugs.
•Test at small scale: Sample at 20 points first. If that works, scale up gradually.

The Length Scale–Grid Resolution Interaction

If your grid spacing is larger than the length scale, samples will appear as uncorrelated random noise—you're effectively only seeing independent draws. Always ensure grid resolution is much finer than ℓ. A good rule: at least 10-20 grid points per length scale unit.

From Prior Samples to Posterior Samples

So far we've focused on sampling from the prior—the GP before seeing data. The real power of GPs comes from the posterior, which incorporates observed data.

The Posterior GP: Given training data $(\mathbf{X}, \mathbf{y})$, the posterior is still a GP with updated mean and covariance:

$$f_* | \mathbf{X}, \mathbf{y}, \mathbf{X}* \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_)$$

where: $$\boldsymbol{\mu}* = \mathbf{K}{f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$ $$\boldsymbol{\Sigma} = \mathbf{K}{**} - \mathbf{K}{f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}{f}$$

Sampling from the Posterior: Replace the prior mean and covariance with posterior versions:

Compute posterior mean $\boldsymbol{\mu}*$ and covariance $\boldsymbol{\Sigma}*$
Cholesky: $\boldsymbol{\Sigma}* = \mathbf{L}* \mathbf{L}_*^\top$
Sample: $\mathbf{f}* = \boldsymbol{\mu}* + \mathbf{L}_* \mathbf{z}$

posterior_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def rbf_kernel(X1, X2, ls=1.0, var=1.0):
    if X1.ndim == 1:
        X1 = X1.reshape(-1, 1)
    if X2.ndim == 1:
        X2 = X2.reshape(-1, 1)
    sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
    return var * np.exp(-0.5 * sqdist / ls**2)
 
def gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=0.1):
    """Compute GP posterior mean and covariance."""
    K = kernel_fn(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_s = kernel_fn(X_test, X_train)
    K_ss = kernel_fn(X_test, X_test)
    
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_s.T, lower=True)
    
    mu = K_s @ alpha
    cov = K_ss - v.T @ v
    
    return mu, cov
 
def sample_posterior(mu, cov, n_samples=5, seed=42):
    """Sample from the posterior GP."""
    np.random.seed(seed)
    L = cholesky(cov + 1e-8 * np.eye(len(mu)), lower=True)
    samples = np.array([mu + L @ np.random.randn(len(mu)) for _ in range(n_samples)])
    return samples
 
# Setup
kernel_fn = lambda X1, X2: rbf_kernel(X1, X2, ls=1.0, var=1.0)
X_test = np.linspace(0, 10, 200)
 
# Training data
X_train = np.array([1.0, 3.0, 5.0, 7.0, 9.0])
y_train = np.array([0.5, -0.8, 0.3, 0.9, -0.2])
 
# Prior samples
K_prior = kernel_fn(X_test, X_test) + 1e-8 * np.eye(len(X_test))
L_prior = cholesky(K_prior, lower=True)
np.random.seed(42)
prior_samples = [L_prior @ np.random.randn(len(X_test)) for _ in range(5)]
 
# Posterior
mu_post, cov_post = gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=0.1)
post_samples = sample_posterior(mu_post, cov_post, n_samples=5)
post_std = np.sqrt(np.diag(cov_post))
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Prior
ax = axes[0]
for s in prior_samples:
    ax.plot(X_test, s, alpha=0.7)
ax.fill_between(X_test, -2, 2, alpha=0.1, color='gray', label='±2σ region')
ax.axhline(0, color='k', linestyle='--', alpha=0.3)
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Prior: Before Seeing Data')
ax.set_ylim([-3.5, 3.5])
ax.grid(True, alpha=0.3)
 
# Posterior
ax = axes[1]
for s in post_samples:
    ax.plot(X_test, s, alpha=0.7)
ax.fill_between(X_test, mu_post - 2*post_std, mu_post + 2*post_std, 
                alpha=0.2, color='blue', label='±2σ region')
ax.plot(X_test, mu_post, 'k-', linewidth=2, label='Posterior mean')
ax.scatter(X_train, y_train, c='red', s=80, zorder=5, 
           edgecolors='black', label='Observations')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Posterior: After Observing 5 Points')
ax.set_ylim([-3.5, 3.5])
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
plt.suptitle('Prior to Posterior: How Observations Constrain Function Space', 
             fontsize=14)
plt.tight_layout()
plt.show()
 
print("Prior samples span all plausible functions.")
print("Posterior samples are constrained to pass near observations.")

The Power of Posterior Sampling

Posterior samples represent 'functions that could have generated our data.' They're essential for: (1) visualizing uncertainty, (2) propagating uncertainty through downstream computations, (3) Monte Carlo estimation of expectations, and (4) Bayesian optimization. The posterior is where GPs truly shine—principled, data-driven uncertainty over functions.

Summary: Sampling from GP Priors

We've now covered all aspects of sampling from Gaussian Process priors—from the basic algorithm to complex composed kernels and posterior sampling.

Key Takeaways

•Sampling reduces to multivariate Gaussian sampling: Compute mean vector and covariance matrix, then use Cholesky decomposition to transform standard Gaussian samples.
•Prior samples reveal what your kernel believes: Before training, sample from your prior and check that the functions look plausible for your domain.
•Hyperparameters have predictable effects: Length scale controls wiggliness, variance controls amplitude, smoothness parameters control differentiability.
•Kernel composition creates rich structure: Addition superposes independent components; multiplication creates interacting effects.
•Computational cost is O(n³): For large n, use approximate methods like Random Fourier Features.
•Posterior sampling shows data influence: After conditioning on observations, samples are constrained to pass near observed points.
•Prior predictive checks are essential: Always visualize prior samples to validate your modeling assumptions.

Module Complete!

Congratulations! You've completed Module 1: GP Fundamentals. You now understand both the function space and weight space perspectives on GPs, the formal GP definition, how mean and covariance functions specify the prior, and how to sample and visualize GP priors. The next module covers GP Regression—applying these fundamentals to learn from data.

Page Complete

You've mastered sampling from Gaussian Process priors. This skill underlies all GP visualization, prior predictive checking, and posterior analysis. You're now prepared to apply GP fundamentals to regression problems in the next module.

5 / 5

Loading learning content...

Machine LearningGaussian Processes

Gaussian Process Fundamentals

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Samples from GP Prior

Visualizing What Your Prior Believes

By drawing samples from a GP prior, we generate concrete functions that are 'typical' according to our prior beliefs. These samples reveal:

What kinds of functions our kernel considers plausible
How hyperparameters affect function behavior
Whether our prior is appropriate for the problem at hand

What You Will Learn

The Sampling Algorithm

Sampling from a GP at a finite set of points reduces to sampling from a multivariate Gaussian. Here's the complete procedure:

Given: GP with mean function $m(\mathbf{x})$ and kernel $k(\mathbf{x}, \mathbf{x}')$

Goal: Sample function values at points $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$

Algorithm:

Compute mean vector: $\boldsymbol{\mu} = [m(\mathbf{x}_1), \ldots, m(\mathbf{x}_n)]^\top$
Compute covariance matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Compute Cholesky decomposition: $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower-triangular
Sample standard Gaussian: $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Transform: $\mathbf{f} = \boldsymbol{\mu} + \mathbf{L} \mathbf{z}$

The resulting $\mathbf{f}$ is a sample from $\mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$—exactly the GP evaluated at points $\mathbf{X}$.

Why Cholesky?

The Cholesky decomposition $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ provides a 'square root' of the covariance matrix. If $\mathbf{z}$ has identity covariance, then:

$$\text{Cov}[\mathbf{L} \mathbf{z}] = \mathbf{L} \text{Cov}[\mathbf{z}] \mathbf{L}^\top = \mathbf{L} \mathbf{I} \mathbf{L}^\top = \mathbf{K}$$

So $\mathbf{L} \mathbf{z}$ has exactly the covariance we need.

Numerical Considerations:

Add a small 'jitter' term to the diagonal: $\mathbf{K} + \epsilon \mathbf{I}$ with $\epsilon \approx 10^{-8}$ to prevent numerical issues when $\mathbf{K}$ is near-singular
Cholesky is $O(n^3)$ and $O(n^2)$ memory—limits practical sample size
For very large $n$, approximate sampling methods exist (random Fourier features, etc.)

gp_prior_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
class GPPrior:
    """
    A Gaussian Process prior for sampling and visualization.
    """
    
    def __init__(self, mean_fn, kernel_fn, jitter=1e-8):
        """
        Args:
            mean_fn: m(X) -> array of mean values
            kernel_fn: k(X1, X2) -> covariance matrix
            jitter: small value added to diagonal for numerical stability
        """
        self.mean_fn = mean_fn
        self.kernel_fn = kernel_fn
        self.jitter = jitter
    
    def sample(self, X, n_samples=1, seed=None):
        """
        Sample function values from GP prior at points X.
        
        Args:
            X: (n,) or (n, d) array of input points
            n_samples: number of independent samples to draw
            seed: random seed for reproducibility
            
        Returns:
            (n_samples, n) array of function values
        """
        if seed is not None:
            np.random.seed(seed)
        
        n = len(X)
        
        # Step 1: Compute mean vector
        mu = self.mean_fn(X)
        
        # Step 2: Compute covariance matrix with jitter
        K = self.kernel_fn(X, X) + self.jitter * np.eye(n)
        
        # Step 3: Cholesky decomposition
        L = cholesky(K, lower=True)
        
        # Step 4 & 5: Sample and transform
        samples = []
        for _ in range(n_samples):
            z = np.random.randn(n)  # Standard Gaussian
            f = mu + L @ z          # Transform to GP sample
            samples.append(f)
        
        return np.array(samples)
 
# Define kernels
def zero_mean(X):
    return np.zeros(len(X))
 
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    sqdist = np.subtract.outer(X1, X2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
# Create GP prior
gp = GPPrior(
    mean_fn=zero_mean,
    kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0, variance=1.0)
)
 
# Sample at many points for smooth visualization
X = np.linspace(0, 10, 200)
samples = gp.sample(X, n_samples=5, seed=42)
 
# Visualize
plt.figure(figsize=(12, 6))
for i, sample in enumerate(samples):
    plt.plot(X, sample, label=f'Sample {i+1}', alpha=0.8)
 
# Show mean and ±2σ region
mu = zero_mean(X)
K = rbf_kernel(X, X, length_scale=1.0, variance=1.0)
std = np.sqrt(np.diag(K))
plt.fill_between(X, mu - 2*std, mu + 2*std, alpha=0.15, color='gray',
                  label='±2σ region')
plt.plot(X, mu, 'k--', alpha=0.5, label='Prior mean')
 
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Samples from GP Prior with RBF Kernel (ℓ=1.0)', fontsize=14)
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
 
print(f"Sampled {len(samples)} functions at {len(X)} points each.")
print("Each curve is one 'draw' from the GP prior distribution.")

Interpreting Prior Samples

Prior samples provide direct visual intuition for what your GP believes before seeing any data. Here's how to interpret them:

Sample Smoothness:

RBF samples are ultra-smooth—no sharp corners or kinks
Matérn 3/2 samples have some roughness—sharp changes possible
Matérn 1/2 (Exponential) samples are jagged—continuous but not differentiable

Sample Amplitude:

The signal variance hyperparameter $\sigma_f^2$ controls the typical magnitude
Expect samples to lie within $\pm 2\sigma_f$ roughly 95% of the time

Sample Length Scale:

Large $\ell$: samples vary slowly—long, gentle undulations
Small $\ell$: samples vary rapidly—many wiggles over short distances
The 'characteristic wavelength' of variation scales with $\ell$

What to Look For in Prior Samples

•Plausibility: Do these functions look like they could explain your data? If your data is periodic but samples are non-periodic, add a periodic component.
•Variability: Is the range of variation appropriate? If samples oscillate between ±0.1 but your data ranges over ±100, adjust $\sigma_f^2$.
•Length scale match: Does the 'wiggliness' match your data? If samples are smoother than data suggests, decrease $\ell$.
•Extrapolation behavior: What do samples do at the domain edges? For RBF, they approach zero mean. Is this sensible for your problem?
•Feature coverage: Do samples exhibit all the behaviors you expect in the data (trends, periodicity, heteroscedasticity)?

Prior Predictive Checking

Hyperparameter Effects on Sample Behavior

Each kernel hyperparameter affects samples in specific, predictable ways. Understanding these effects helps you set sensible initial values and interpret learned hyperparameters.

Signal Variance $\sigma_f^2$:

This scales the overall amplitude of variation. Mathematically, $k(\mathbf{x}, \mathbf{x}) = \sigma_f^2$, so:

$\sigma_f^2 = 1$: samples typically range $\pm 2$
$\sigma_f^2 = 100$: samples typically range $\pm 20$

Length Scale $\ell$:

This controls the 'correlation distance'—roughly how far you can extrapolate before uncertainty increases:

Large $\ell$: function values at distant points are correlated; samples are smooth
Small $\ell$: correlations decay quickly; samples are wiggly

Rule of thumb: set $\ell$ to the distance over which you expect significant variation in your function.

hyperparameter_effects.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    sqdist = np.subtract.outer(X1, X2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
def sample_gp(X, kernel_fn, n_samples=3, seed=42):
    np.random.seed(seed)
    K = kernel_fn(X, X) + 1e-8 * np.eye(len(X))
    L = cholesky(K, lower=True)
    return np.array([L @ np.random.randn(len(X)) for _ in range(n_samples)])
 
X = np.linspace(0, 10, 200)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Vary length scale (top row)
length_scales = [0.3, 1.0, 3.0]
for idx, ls in enumerate(length_scales):
    ax = axes[0, idx]
    samples = sample_gp(X, lambda X1, X2: rbf_kernel(X1, X2, ls, 1.0))
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(f'Length Scale ℓ = {ls}', fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim([-4, 4])
    ax.grid(True, alpha=0.3)
 
# Vary signal variance (bottom row)  
variances = [0.25, 1.0, 4.0]
for idx, var in enumerate(variances):
    ax = axes[1, idx]
    samples = sample_gp(X, lambda X1, X2: rbf_kernel(X1, X2, 1.0, var))
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(f'Signal Variance σ² = {var}', fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim([-6, 6])
    ax.grid(True, alpha=0.3)
    
    # Show ±2σ region
    std = np.sqrt(var)
    ax.axhline(y=2*std, color='r', linestyle='--', alpha=0.5)
    ax.axhline(y=-2*std, color='r', linestyle='--', alpha=0.5)
 
plt.suptitle('Effect of RBF Kernel Hyperparameters on Prior Samples', fontsize=14)
plt.tight_layout()
plt.show()
 
print("Top row: Length scale affects 'wiggliness' (smaller = more wiggly)")
print("Bottom row: Signal variance affects amplitude (red lines show ±2σ)")

Summary of Hyperparameter Effects
Hyperparameter	Effect When Increased	Effect When Decreased
Length scale $\ell$	Smoother, slower variation	Rougher, faster variation
Signal variance $\sigma_f^2$	Larger amplitude swings	Smaller amplitude swings
Matérn $\nu$	Smoother samples	Rougher samples
Period $p$	Longer repetition cycle	Shorter repetition cycle
ARD length scales	Dimension becomes irrelevant	Dimension becomes highly relevant

Sampling from Composed Kernels

When you combine kernels through addition or multiplication, the resulting samples reflect the combined properties. Visualizing these samples is crucial for understanding what complex kernels encode.

Addition ($k = k_1 + k_2$):

Samples from $k = k_1 + k_2$ are sums of independent samples from $k_1$ and $k_2$:

If $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ independently
Then $f_1 + f_2 \sim \mathcal{GP}(0, k_1 + k_2)$

Interpretation: Samples exhibit features from both components superimposed.

Multiplication ($k = k_1 \cdot k_2$):

Multiplication is more complex—it's not about independent addition:

Samples from $k_1 \cdot k_2$ exhibit properties of both kernels interacting
Often, one kernel 'modulates' the other

Example: Periodic × RBF creates periodic functions that decay in amplitude.

composed_kernel_samples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(X1, X2, ls=1.0, var=1.0):
    return var * np.exp(-0.5 * np.subtract.outer(X1, X2)**2 / ls**2)
 
def periodic_kernel(X1, X2, period=1.0, ls=1.0, var=1.0):
    diff = np.subtract.outer(X1, X2)
    return var * np.exp(-2 * np.sin(np.pi * np.abs(diff) / period)**2 / ls**2)
 
def linear_kernel(X1, X2, var=1.0, bias=1.0):
    return bias + var * np.outer(X1, X2)
 
def sample_gp(X, K_fn, n_samples=3, seed=42):
    np.random.seed(seed)
    K = K_fn(X, X) + 1e-8 * np.eye(len(X))
    L = cholesky(K, lower=True)
    return np.array([L @ np.random.randn(len(X)) for _ in range(n_samples)])
 
X = np.linspace(0, 15, 300)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Row 1: Component kernels
kernels_1 = [
    ("RBF (ℓ=2)", lambda X1, X2: rbf_kernel(X1, X2, ls=2.0)),
    ("Periodic (p=3)", lambda X1, X2: periodic_kernel(X1, X2, period=3.0, ls=0.5)),
    ("Linear", lambda X1, X2: linear_kernel(X1, X2, var=0.1, bias=0.1)),
]
 
for idx, (name, k_fn) in enumerate(kernels_1):
    samples = sample_gp(X, k_fn)
    ax = axes[0, idx]
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(name, fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
# Row 2: Composed kernels
kernels_2 = [
    ("RBF + Periodic", lambda X1, X2: rbf_kernel(X1, X2, ls=4.0) + 
     periodic_kernel(X1, X2, period=3.0, ls=0.5, var=0.5)),
    ("Periodic × RBF (decay)", lambda X1, X2: periodic_kernel(X1, X2, period=2.0, ls=0.5) * 
     rbf_kernel(X1, X2, ls=5.0)),
    ("Linear × Periodic", lambda X1, X2: linear_kernel(X1, X2, var=0.05, bias=0.01) * 
     periodic_kernel(X1, X2, period=2.0, ls=0.5)),
]
 
for idx, (name, k_fn) in enumerate(kernels_2):
    samples = sample_gp(X, k_fn)
    ax = axes[1, idx]
    for s in samples:
        ax.plot(X, s, alpha=0.8)
    ax.set_title(name, fontsize=12)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Kernel Composition: Components (top) and Combinations (bottom)', 
             fontsize=14)
plt.tight_layout()
plt.show()
 
print("Addition superposes components; multiplication creates interaction.")

Composition Intuition

Reproducibility and Random Seeds

For scientific work and debugging, reproducible sampling is essential. Here's how to ensure consistent results.

Random Seed Setting:

np.random.seed(42)  # Before any sampling
samples = gp.sample(X, n_samples=5)

With a fixed seed, the same samples are generated every time. This enables:

Reproducible figures for papers
Consistent debugging
Fair comparisons between methods

Generating a Family of Diverse Samples:

for trial in range(10):
    np.random.seed(trial)  # Different seed each trial
    samples = gp.sample(X)
    # Process results...

Best Practice for Research

Computational Considerations for Sampling

Sampling from a GP at $n$ points requires $O(n^3)$ time (for Cholesky) and $O(n^2)$ memory (for the kernel matrix). This limits the number of visualization points.

Practical Limits:

$n = 100$: Instantaneous
$n = 1,000$: A second or two
$n = 10,000$: Minutes to tens of minutes
$n = 100,000$: Infeasible with exact methods

Strategies for Dense Visualization:

Subsample then interpolate: Sample at sparse points, use GP posterior mean to fill in (but this changes the character of samples)
Iterative refinement: Start with coarse grid, add points where samples vary rapidly
Approximate methods:
- Random Fourier Features: $O(nm^2)$ where $m$ is number of features
- Circulant approximations: $O(n \log n)$ for gridded data with stationary kernels

Efficient Sampling with Multiple Independent Samples:

Once you've computed the Cholesky decomposition $\mathbf{L}$, drawing additional samples is cheap—just $O(n^2)$ per sample:

L = cholesky(K, lower=True)  # O(n³) once
samples = [L @ np.random.randn(n) for _ in range(100)]  # O(n²) each

Caching: If you're exploring hyperparameters interactively, cache Cholesky factors for kernels you've already computed.

Numerical Stability: Near-singular kernel matrices (when points are very close or the kernel is ill-conditioned) can cause Cholesky to fail. Solutions:

Add larger jitter: $\mathbf{K} + 10^{-6} \mathbf{I}$
Use pivoted Cholesky or eigendecomposition (more stable but slower)
Reconsider whether your kernel is appropriate for this data configuration

Sampling Complexity Summary
Operation	Time Complexity	Memory
Kernel matrix construction	$O(n^2)$	$O(n^2)$
Cholesky decomposition	$O(n^3)$	$O(n^2)$
Single sample (given L)	$O(n^2)$	$O(n)$
M samples (given L)	$O(Mn^2)$	$O(n^2 + Mn)$
RFF approximation	$O(nm^2)$	$O(nm)$

Sampling from Multi-Output GPs

In many applications, we need to model multiple correlated outputs simultaneously. Multi-output GPs (MOGPs) extend the basic GP framework to handle this.

The Setup: We have $P$ output functions $f_1, \ldots, f_P$, each evaluated at potentially different input sets. The full output vector is:

$$\mathbf{f} = [f_1(\mathbf{X}_1), f_2(\mathbf{X}_2), \ldots, f_P(\mathbf{X}_P)]$$

The Multi-Output Kernel: A kernel for MOGPs specifies covariances between outputs at different inputs:

$$k((\mathbf{x}, p), (\mathbf{x}', p')) = k_{\text{input}}(\mathbf{x}, \mathbf{x}') \cdot k_{\text{output}}(p, p')$$

The simplest form is the Intrinsic Coregionalization Model (ICM): $$\mathbf{K}{\text{MOGP}} = \mathbf{B} \otimes \mathbf{K}{\text{input}}$$

where $\mathbf{B}$ is a $P \times P$ positive semi-definite matrix encoding output correlations, and $\otimes$ is the Kronecker product.

multi_output_gp_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(X1, X2, ls=1.0, var=1.0):
    return var * np.exp(-0.5 * np.subtract.outer(X1, X2)**2 / ls**2)
 
def sample_mogp(X, K_input, B, n_samples=3, seed=42):
    """
    Sample from Multi-Output GP using ICM model.
    
    K_MOGP = B ⊗ K_input
    
    Args:
        X: input points (n,)
        K_input: input kernel matrix (n, n)
        B: output covariance matrix (P, P)
        
    Returns:
        (n_samples, P, n) array of function values
    """
    np.random.seed(seed)
    n = len(X)
    P = B.shape[0]
    
    # Build full covariance via Kronecker product
    K_full = np.kron(B, K_input) + 1e-8 * np.eye(P * n)
    L = cholesky(K_full, lower=True)
    
    samples = []
    for _ in range(n_samples):
        z = np.random.randn(P * n)
        f = L @ z
        f = f.reshape(P, n)  # Reshape to (outputs, inputs)
        samples.append(f)
    
    return np.array(samples)
 
# Define correlated outputs
X = np.linspace(0, 10, 100)
K_input = rbf_kernel(X, X, ls=2.0)
 
# Output correlation matrix (3 outputs)
# Positive correlations between 1-2, negative between 1-3
B = np.array([
    [1.0,  0.7, -0.5],
    [0.7,  1.0,  0.2],
    [-0.5, 0.2,  1.0]
])
 
# Sample
samples = sample_mogp(X, K_input, B, n_samples=3)
 
# Visualize
fig, axes = plt.subplots(3, 1, figsize=(12, 8), sharex=True)
colors = ['tab:blue', 'tab:orange', 'tab:green']
labels = ['Output 1', 'Output 2 (corr. with 1)', 'Output 3 (anti-corr. with 1)']
 
for sample_idx in range(3):
    sample = samples[sample_idx]
    for output_idx in range(3):
        ax = axes[output_idx]
        alpha = 0.9 if sample_idx == 0 else 0.5
        ax.plot(X, sample[output_idx], color=colors[output_idx], 
                alpha=alpha, linewidth=1.5)
        if sample_idx == 0:
            ax.set_ylabel(labels[output_idx])
            ax.grid(True, alpha=0.3)
 
axes[2].set_xlabel('x')
fig.suptitle('Multi-Output GP Samples (Correlated Outputs)', fontsize=14)
plt.tight_layout()
plt.show()
 
print("Notice how outputs 1 and 2 tend to move together (positive correlation)")
print("While outputs 1 and 3 tend to move oppositely (negative correlation)")

Pathological Cases and Debugging

Sometimes GP sampling goes wrong. Here are common failure modes and how to diagnose them:

1. Cholesky Decomposition Fails

Symptom: LinAlgError: Matrix is not positive definite

Causes:

Insufficient jitter
Numerically identical input points
Ill-conditioned kernel (e.g., very long length scale compared to input range)

Fixes:

Increase jitter: K + 1e-4 * I
Remove duplicate/near-duplicate inputs
Rescale inputs to be O(1)

2. Samples Look 'Wrong' (All Flat, Exploding, etc.)

Symptom: Samples are constant, or grow unboundedly, or have artifacts.

Causes:

Length scale too large → samples nearly constant
Length scale too small → random noise (if grid too coarse)
Variance mismatch with mean function

Fixes:

Rescale hyperparameters
Use denser grid for visualization
Check mean function implementation

Debugging Checklist

•Verify kernel produces correct values: Check that $k(\mathbf{x}, \mathbf{x}) = \sigma_f^2$ (the diagonal should equal signal variance)
•Inspect eigenvalues: np.linalg.eigvalsh(K) should all be non-negative. Large spread indicates ill-conditioning.
•Check for NaN/Inf: Before Cholesky, verify np.all(np.isfinite(K))
•Visualize the kernel matrix: plt.imshow(K) reveals structure. Should be smooth, symmetric; artifacts indicate bugs.
•Test at small scale: Sample at 20 points first. If that works, scale up gradually.

The Length Scale–Grid Resolution Interaction

From Prior Samples to Posterior Samples

So far we've focused on sampling from the prior—the GP before seeing data. The real power of GPs comes from the posterior, which incorporates observed data.

The Posterior GP: Given training data $(\mathbf{X}, \mathbf{y})$, the posterior is still a GP with updated mean and covariance:

$$f_* | \mathbf{X}, \mathbf{y}, \mathbf{X}* \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_)$$

Sampling from the Posterior: Replace the prior mean and covariance with posterior versions:

Compute posterior mean $\boldsymbol{\mu}*$ and covariance $\boldsymbol{\Sigma}*$
Cholesky: $\boldsymbol{\Sigma}* = \mathbf{L}* \mathbf{L}_*^\top$
Sample: $\mathbf{f}* = \boldsymbol{\mu}* + \mathbf{L}_* \mathbf{z}$

posterior_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def rbf_kernel(X1, X2, ls=1.0, var=1.0):
    if X1.ndim == 1:
        X1 = X1.reshape(-1, 1)
    if X2.ndim == 1:
        X2 = X2.reshape(-1, 1)
    sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
    return var * np.exp(-0.5 * sqdist / ls**2)
 
def gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=0.1):
    """Compute GP posterior mean and covariance."""
    K = kernel_fn(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_s = kernel_fn(X_test, X_train)
    K_ss = kernel_fn(X_test, X_test)
    
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_s.T, lower=True)
    
    mu = K_s @ alpha
    cov = K_ss - v.T @ v
    
    return mu, cov
 
def sample_posterior(mu, cov, n_samples=5, seed=42):
    """Sample from the posterior GP."""
    np.random.seed(seed)
    L = cholesky(cov + 1e-8 * np.eye(len(mu)), lower=True)
    samples = np.array([mu + L @ np.random.randn(len(mu)) for _ in range(n_samples)])
    return samples
 
# Setup
kernel_fn = lambda X1, X2: rbf_kernel(X1, X2, ls=1.0, var=1.0)
X_test = np.linspace(0, 10, 200)
 
# Training data
X_train = np.array([1.0, 3.0, 5.0, 7.0, 9.0])
y_train = np.array([0.5, -0.8, 0.3, 0.9, -0.2])
 
# Prior samples
K_prior = kernel_fn(X_test, X_test) + 1e-8 * np.eye(len(X_test))
L_prior = cholesky(K_prior, lower=True)
np.random.seed(42)
prior_samples = [L_prior @ np.random.randn(len(X_test)) for _ in range(5)]
 
# Posterior
mu_post, cov_post = gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=0.1)
post_samples = sample_posterior(mu_post, cov_post, n_samples=5)
post_std = np.sqrt(np.diag(cov_post))
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Prior
ax = axes[0]
for s in prior_samples:
    ax.plot(X_test, s, alpha=0.7)
ax.fill_between(X_test, -2, 2, alpha=0.1, color='gray', label='±2σ region')
ax.axhline(0, color='k', linestyle='--', alpha=0.3)
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Prior: Before Seeing Data')
ax.set_ylim([-3.5, 3.5])
ax.grid(True, alpha=0.3)
 
# Posterior
ax = axes[1]
for s in post_samples:
    ax.plot(X_test, s, alpha=0.7)
ax.fill_between(X_test, mu_post - 2*post_std, mu_post + 2*post_std, 
                alpha=0.2, color='blue', label='±2σ region')
ax.plot(X_test, mu_post, 'k-', linewidth=2, label='Posterior mean')
ax.scatter(X_train, y_train, c='red', s=80, zorder=5, 
           edgecolors='black', label='Observations')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Posterior: After Observing 5 Points')
ax.set_ylim([-3.5, 3.5])
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
plt.suptitle('Prior to Posterior: How Observations Constrain Function Space', 
             fontsize=14)
plt.tight_layout()
plt.show()
 
print("Prior samples span all plausible functions.")
print("Posterior samples are constrained to pass near observations.")

The Power of Posterior Sampling

Summary: Sampling from GP Priors

We've now covered all aspects of sampling from Gaussian Process priors—from the basic algorithm to complex composed kernels and posterior sampling.

Key Takeaways

•Sampling reduces to multivariate Gaussian sampling: Compute mean vector and covariance matrix, then use Cholesky decomposition to transform standard Gaussian samples.
•Prior samples reveal what your kernel believes: Before training, sample from your prior and check that the functions look plausible for your domain.
•Hyperparameters have predictable effects: Length scale controls wiggliness, variance controls amplitude, smoothness parameters control differentiability.
•Kernel composition creates rich structure: Addition superposes independent components; multiplication creates interacting effects.
•Computational cost is O(n³): For large n, use approximate methods like Random Fourier Features.
•Posterior sampling shows data influence: After conditioning on observations, samples are constrained to pass near observed points.
•Prior predictive checks are essential: Always visualize prior samples to validate your modeling assumptions.

Module Complete!

Page Complete

5 / 5