Loading learning content...
A Gaussian Process prior is an abstract mathematical object—a distribution over an infinite-dimensional function space. How can we understand what it actually represents? The answer is sampling.
By drawing samples from a GP prior, we generate concrete functions that are 'typical' according to our prior beliefs. These samples reveal:
The fundamental insight: Before fitting any data, you should sample from your prior and ask: 'Do these functions look like reasonable explanations for my data?' If not, your prior is misspecified, and it's time to revise your kernel choices.
By the end of this page, you will master the mathematical and computational methods for sampling from GP priors, understand how to use prior samples for model criticism and hyperparameter intuition, and develop a practical workflow for validating GP models before training.
Sampling from a GP at a finite set of points reduces to sampling from a multivariate Gaussian. Here's the complete procedure:
Given: GP with mean function $m(\mathbf{x})$ and kernel $k(\mathbf{x}, \mathbf{x}')$
Goal: Sample function values at points $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$
Algorithm:
Compute mean vector: $\boldsymbol{\mu} = [m(\mathbf{x}_1), \ldots, m(\mathbf{x}_n)]^\top$
Compute covariance matrix: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Compute Cholesky decomposition: $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower-triangular
Sample standard Gaussian: $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Transform: $\mathbf{f} = \boldsymbol{\mu} + \mathbf{L} \mathbf{z}$
The resulting $\mathbf{f}$ is a sample from $\mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$—exactly the GP evaluated at points $\mathbf{X}$.
Why Cholesky?
The Cholesky decomposition $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ provides a 'square root' of the covariance matrix. If $\mathbf{z}$ has identity covariance, then:
$$\text{Cov}[\mathbf{L} \mathbf{z}] = \mathbf{L} \text{Cov}[\mathbf{z}] \mathbf{L}^\top = \mathbf{L} \mathbf{I} \mathbf{L}^\top = \mathbf{K}$$
So $\mathbf{L} \mathbf{z}$ has exactly the covariance we need.
Numerical Considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky class GPPrior: """ A Gaussian Process prior for sampling and visualization. """ def __init__(self, mean_fn, kernel_fn, jitter=1e-8): """ Args: mean_fn: m(X) -> array of mean values kernel_fn: k(X1, X2) -> covariance matrix jitter: small value added to diagonal for numerical stability """ self.mean_fn = mean_fn self.kernel_fn = kernel_fn self.jitter = jitter def sample(self, X, n_samples=1, seed=None): """ Sample function values from GP prior at points X. Args: X: (n,) or (n, d) array of input points n_samples: number of independent samples to draw seed: random seed for reproducibility Returns: (n_samples, n) array of function values """ if seed is not None: np.random.seed(seed) n = len(X) # Step 1: Compute mean vector mu = self.mean_fn(X) # Step 2: Compute covariance matrix with jitter K = self.kernel_fn(X, X) + self.jitter * np.eye(n) # Step 3: Cholesky decomposition L = cholesky(K, lower=True) # Step 4 & 5: Sample and transform samples = [] for _ in range(n_samples): z = np.random.randn(n) # Standard Gaussian f = mu + L @ z # Transform to GP sample samples.append(f) return np.array(samples) # Define kernelsdef zero_mean(X): return np.zeros(len(X)) def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0): sqdist = np.subtract.outer(X1, X2)**2 return variance * np.exp(-0.5 * sqdist / length_scale**2) # Create GP priorgp = GPPrior( mean_fn=zero_mean, kernel_fn=lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0, variance=1.0)) # Sample at many points for smooth visualizationX = np.linspace(0, 10, 200)samples = gp.sample(X, n_samples=5, seed=42) # Visualizeplt.figure(figsize=(12, 6))for i, sample in enumerate(samples): plt.plot(X, sample, label=f'Sample {i+1}', alpha=0.8) # Show mean and ±2σ regionmu = zero_mean(X)K = rbf_kernel(X, X, length_scale=1.0, variance=1.0)std = np.sqrt(np.diag(K))plt.fill_between(X, mu - 2*std, mu + 2*std, alpha=0.15, color='gray', label='±2σ region')plt.plot(X, mu, 'k--', alpha=0.5, label='Prior mean') plt.xlabel('x', fontsize=12)plt.ylabel('f(x)', fontsize=12)plt.title('Samples from GP Prior with RBF Kernel (ℓ=1.0)', fontsize=14)plt.legend(loc='upper right')plt.grid(True, alpha=0.3)plt.tight_layout()plt.show() print(f"Sampled {len(samples)} functions at {len(X)} points each.")print("Each curve is one 'draw' from the GP prior distribution.")Prior samples provide direct visual intuition for what your GP believes before seeing any data. Here's how to interpret them:
Sample Smoothness:
Sample Amplitude:
Sample Length Scale:
This process of sampling from the prior and visually checking is called 'prior predictive checking.' It's a cornerstone of Bayesian workflow—validate that your assumptions are reasonable before conditioning on data. If samples look implausible, revise your kernel or hyperparameters. The data should narrow down an already-sensible prior, not rescue a bad one.
Each kernel hyperparameter affects samples in specific, predictable ways. Understanding these effects helps you set sensible initial values and interpret learned hyperparameters.
Signal Variance $\sigma_f^2$:
This scales the overall amplitude of variation. Mathematically, $k(\mathbf{x}, \mathbf{x}) = \sigma_f^2$, so:
Length Scale $\ell$:
This controls the 'correlation distance'—roughly how far you can extrapolate before uncertainty increases:
Rule of thumb: set $\ell$ to the distance over which you expect significant variation in your function.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0): sqdist = np.subtract.outer(X1, X2)**2 return variance * np.exp(-0.5 * sqdist / length_scale**2) def sample_gp(X, kernel_fn, n_samples=3, seed=42): np.random.seed(seed) K = kernel_fn(X, X) + 1e-8 * np.eye(len(X)) L = cholesky(K, lower=True) return np.array([L @ np.random.randn(len(X)) for _ in range(n_samples)]) X = np.linspace(0, 10, 200) fig, axes = plt.subplots(2, 3, figsize=(15, 8)) # Vary length scale (top row)length_scales = [0.3, 1.0, 3.0]for idx, ls in enumerate(length_scales): ax = axes[0, idx] samples = sample_gp(X, lambda X1, X2: rbf_kernel(X1, X2, ls, 1.0)) for s in samples: ax.plot(X, s, alpha=0.8) ax.set_title(f'Length Scale ℓ = {ls}', fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.set_ylim([-4, 4]) ax.grid(True, alpha=0.3) # Vary signal variance (bottom row) variances = [0.25, 1.0, 4.0]for idx, var in enumerate(variances): ax = axes[1, idx] samples = sample_gp(X, lambda X1, X2: rbf_kernel(X1, X2, 1.0, var)) for s in samples: ax.plot(X, s, alpha=0.8) ax.set_title(f'Signal Variance σ² = {var}', fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.set_ylim([-6, 6]) ax.grid(True, alpha=0.3) # Show ±2σ region std = np.sqrt(var) ax.axhline(y=2*std, color='r', linestyle='--', alpha=0.5) ax.axhline(y=-2*std, color='r', linestyle='--', alpha=0.5) plt.suptitle('Effect of RBF Kernel Hyperparameters on Prior Samples', fontsize=14)plt.tight_layout()plt.show() print("Top row: Length scale affects 'wiggliness' (smaller = more wiggly)")print("Bottom row: Signal variance affects amplitude (red lines show ±2σ)")| Hyperparameter | Effect When Increased | Effect When Decreased |
|---|---|---|
| Length scale $\ell$ | Smoother, slower variation | Rougher, faster variation |
| Signal variance $\sigma_f^2$ | Larger amplitude swings | Smaller amplitude swings |
| Matérn $\nu$ | Smoother samples | Rougher samples |
| Period $p$ | Longer repetition cycle | Shorter repetition cycle |
| ARD length scales | Dimension becomes irrelevant | Dimension becomes highly relevant |
When you combine kernels through addition or multiplication, the resulting samples reflect the combined properties. Visualizing these samples is crucial for understanding what complex kernels encode.
Addition ($k = k_1 + k_2$):
Samples from $k = k_1 + k_2$ are sums of independent samples from $k_1$ and $k_2$:
Interpretation: Samples exhibit features from both components superimposed.
Multiplication ($k = k_1 \cdot k_2$):
Multiplication is more complex—it's not about independent addition:
Example: Periodic × RBF creates periodic functions that decay in amplitude.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky def rbf_kernel(X1, X2, ls=1.0, var=1.0): return var * np.exp(-0.5 * np.subtract.outer(X1, X2)**2 / ls**2) def periodic_kernel(X1, X2, period=1.0, ls=1.0, var=1.0): diff = np.subtract.outer(X1, X2) return var * np.exp(-2 * np.sin(np.pi * np.abs(diff) / period)**2 / ls**2) def linear_kernel(X1, X2, var=1.0, bias=1.0): return bias + var * np.outer(X1, X2) def sample_gp(X, K_fn, n_samples=3, seed=42): np.random.seed(seed) K = K_fn(X, X) + 1e-8 * np.eye(len(X)) L = cholesky(K, lower=True) return np.array([L @ np.random.randn(len(X)) for _ in range(n_samples)]) X = np.linspace(0, 15, 300) fig, axes = plt.subplots(2, 3, figsize=(15, 8)) # Row 1: Component kernelskernels_1 = [ ("RBF (ℓ=2)", lambda X1, X2: rbf_kernel(X1, X2, ls=2.0)), ("Periodic (p=3)", lambda X1, X2: periodic_kernel(X1, X2, period=3.0, ls=0.5)), ("Linear", lambda X1, X2: linear_kernel(X1, X2, var=0.1, bias=0.1)),] for idx, (name, k_fn) in enumerate(kernels_1): samples = sample_gp(X, k_fn) ax = axes[0, idx] for s in samples: ax.plot(X, s, alpha=0.8) ax.set_title(name, fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) # Row 2: Composed kernelskernels_2 = [ ("RBF + Periodic", lambda X1, X2: rbf_kernel(X1, X2, ls=4.0) + periodic_kernel(X1, X2, period=3.0, ls=0.5, var=0.5)), ("Periodic × RBF (decay)", lambda X1, X2: periodic_kernel(X1, X2, period=2.0, ls=0.5) * rbf_kernel(X1, X2, ls=5.0)), ("Linear × Periodic", lambda X1, X2: linear_kernel(X1, X2, var=0.05, bias=0.01) * periodic_kernel(X1, X2, period=2.0, ls=0.5)),] for idx, (name, k_fn) in enumerate(kernels_2): samples = sample_gp(X, k_fn) ax = axes[1, idx] for s in samples: ax.plot(X, s, alpha=0.8) ax.set_title(name, fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('Kernel Composition: Components (top) and Combinations (bottom)', fontsize=14)plt.tight_layout()plt.show() print("Addition superposes components; multiplication creates interaction.")Addition: 'My function is the sum of component A and component B.' Use for independent decomposable structure (trend + seasonal + noise). Multiplication: 'Component A modulates how component B behaves.' Use for interacting effects (decaying periodicity, varying variance, growing/shrinking patterns).
For scientific work and debugging, reproducible sampling is essential. Here's how to ensure consistent results.
Random Seed Setting:
np.random.seed(42) # Before any sampling
samples = gp.sample(X, n_samples=5)
With a fixed seed, the same samples are generated every time. This enables:
Generating a Family of Diverse Samples:
for trial in range(10):
np.random.seed(trial) # Different seed each trial
samples = gp.sample(X)
# Process results...
Thread Safety: In parallel code, be careful—multiple threads sharing a random state can produce non-reproducible results. Use thread-local random generators or explicitly pass random state objects.
Always set and record your random seed at the start of experiments. Include it in supplementary materials for papers. Many subtle bugs and 'irreproducible results' stem from forgotten or inconsistent random seeds. Modern libraries (NumPy ≥1.17, PyTorch) offer explicit random state management via Generator objects—prefer these over global state.
Sampling from a GP at $n$ points requires $O(n^3)$ time (for Cholesky) and $O(n^2)$ memory (for the kernel matrix). This limits the number of visualization points.
Practical Limits:
Strategies for Dense Visualization:
Subsample then interpolate: Sample at sparse points, use GP posterior mean to fill in (but this changes the character of samples)
Iterative refinement: Start with coarse grid, add points where samples vary rapidly
Approximate methods:
Efficient Sampling with Multiple Independent Samples:
Once you've computed the Cholesky decomposition $\mathbf{L}$, drawing additional samples is cheap—just $O(n^2)$ per sample:
L = cholesky(K, lower=True) # O(n³) once
samples = [L @ np.random.randn(n) for _ in range(100)] # O(n²) each
Caching: If you're exploring hyperparameters interactively, cache Cholesky factors for kernels you've already computed.
Numerical Stability: Near-singular kernel matrices (when points are very close or the kernel is ill-conditioned) can cause Cholesky to fail. Solutions:
| Operation | Time Complexity | Memory |
|---|---|---|
| Kernel matrix construction | $O(n^2)$ | $O(n^2)$ |
| Cholesky decomposition | $O(n^3)$ | $O(n^2)$ |
| Single sample (given L) | $O(n^2)$ | $O(n)$ |
| M samples (given L) | $O(Mn^2)$ | $O(n^2 + Mn)$ |
| RFF approximation | $O(nm^2)$ | $O(nm)$ |
In many applications, we need to model multiple correlated outputs simultaneously. Multi-output GPs (MOGPs) extend the basic GP framework to handle this.
The Setup: We have $P$ output functions $f_1, \ldots, f_P$, each evaluated at potentially different input sets. The full output vector is:
$$\mathbf{f} = [f_1(\mathbf{X}_1), f_2(\mathbf{X}_2), \ldots, f_P(\mathbf{X}_P)]$$
The Multi-Output Kernel: A kernel for MOGPs specifies covariances between outputs at different inputs:
$$k((\mathbf{x}, p), (\mathbf{x}', p')) = k_{\text{input}}(\mathbf{x}, \mathbf{x}') \cdot k_{\text{output}}(p, p')$$
The simplest form is the Intrinsic Coregionalization Model (ICM): $$\mathbf{K}{\text{MOGP}} = \mathbf{B} \otimes \mathbf{K}{\text{input}}$$
where $\mathbf{B}$ is a $P \times P$ positive semi-definite matrix encoding output correlations, and $\otimes$ is the Kronecker product.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky def rbf_kernel(X1, X2, ls=1.0, var=1.0): return var * np.exp(-0.5 * np.subtract.outer(X1, X2)**2 / ls**2) def sample_mogp(X, K_input, B, n_samples=3, seed=42): """ Sample from Multi-Output GP using ICM model. K_MOGP = B ⊗ K_input Args: X: input points (n,) K_input: input kernel matrix (n, n) B: output covariance matrix (P, P) Returns: (n_samples, P, n) array of function values """ np.random.seed(seed) n = len(X) P = B.shape[0] # Build full covariance via Kronecker product K_full = np.kron(B, K_input) + 1e-8 * np.eye(P * n) L = cholesky(K_full, lower=True) samples = [] for _ in range(n_samples): z = np.random.randn(P * n) f = L @ z f = f.reshape(P, n) # Reshape to (outputs, inputs) samples.append(f) return np.array(samples) # Define correlated outputsX = np.linspace(0, 10, 100)K_input = rbf_kernel(X, X, ls=2.0) # Output correlation matrix (3 outputs)# Positive correlations between 1-2, negative between 1-3B = np.array([ [1.0, 0.7, -0.5], [0.7, 1.0, 0.2], [-0.5, 0.2, 1.0]]) # Samplesamples = sample_mogp(X, K_input, B, n_samples=3) # Visualizefig, axes = plt.subplots(3, 1, figsize=(12, 8), sharex=True)colors = ['tab:blue', 'tab:orange', 'tab:green']labels = ['Output 1', 'Output 2 (corr. with 1)', 'Output 3 (anti-corr. with 1)'] for sample_idx in range(3): sample = samples[sample_idx] for output_idx in range(3): ax = axes[output_idx] alpha = 0.9 if sample_idx == 0 else 0.5 ax.plot(X, sample[output_idx], color=colors[output_idx], alpha=alpha, linewidth=1.5) if sample_idx == 0: ax.set_ylabel(labels[output_idx]) ax.grid(True, alpha=0.3) axes[2].set_xlabel('x')fig.suptitle('Multi-Output GP Samples (Correlated Outputs)', fontsize=14)plt.tight_layout()plt.show() print("Notice how outputs 1 and 2 tend to move together (positive correlation)")print("While outputs 1 and 3 tend to move oppositely (negative correlation)")Sometimes GP sampling goes wrong. Here are common failure modes and how to diagnose them:
1. Cholesky Decomposition Fails
Symptom: LinAlgError: Matrix is not positive definite
Causes:
Fixes:
K + 1e-4 * I2. Samples Look 'Wrong' (All Flat, Exploding, etc.)
Symptom: Samples are constant, or grow unboundedly, or have artifacts.
Causes:
Fixes:
np.linalg.eigvalsh(K) should all be non-negative. Large spread indicates ill-conditioning.np.all(np.isfinite(K))plt.imshow(K) reveals structure. Should be smooth, symmetric; artifacts indicate bugs.If your grid spacing is larger than the length scale, samples will appear as uncorrelated random noise—you're effectively only seeing independent draws. Always ensure grid resolution is much finer than ℓ. A good rule: at least 10-20 grid points per length scale unit.
So far we've focused on sampling from the prior—the GP before seeing data. The real power of GPs comes from the posterior, which incorporates observed data.
The Posterior GP: Given training data $(\mathbf{X}, \mathbf{y})$, the posterior is still a GP with updated mean and covariance:
$$f_* | \mathbf{X}, \mathbf{y}, \mathbf{X}* \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_)$$
where: $$\boldsymbol{\mu}* = \mathbf{K}{f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$ $$\boldsymbol{\Sigma} = \mathbf{K}{**} - \mathbf{K}{f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}{f}$$
Sampling from the Posterior: Replace the prior mean and covariance with posterior versions:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular def rbf_kernel(X1, X2, ls=1.0, var=1.0): if X1.ndim == 1: X1 = X1.reshape(-1, 1) if X2.ndim == 1: X2 = X2.reshape(-1, 1) sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T return var * np.exp(-0.5 * sqdist / ls**2) def gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=0.1): """Compute GP posterior mean and covariance.""" K = kernel_fn(X_train, X_train) + noise_var * np.eye(len(X_train)) K_s = kernel_fn(X_test, X_train) K_ss = kernel_fn(X_test, X_test) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_s.T, lower=True) mu = K_s @ alpha cov = K_ss - v.T @ v return mu, cov def sample_posterior(mu, cov, n_samples=5, seed=42): """Sample from the posterior GP.""" np.random.seed(seed) L = cholesky(cov + 1e-8 * np.eye(len(mu)), lower=True) samples = np.array([mu + L @ np.random.randn(len(mu)) for _ in range(n_samples)]) return samples # Setupkernel_fn = lambda X1, X2: rbf_kernel(X1, X2, ls=1.0, var=1.0)X_test = np.linspace(0, 10, 200) # Training dataX_train = np.array([1.0, 3.0, 5.0, 7.0, 9.0])y_train = np.array([0.5, -0.8, 0.3, 0.9, -0.2]) # Prior samplesK_prior = kernel_fn(X_test, X_test) + 1e-8 * np.eye(len(X_test))L_prior = cholesky(K_prior, lower=True)np.random.seed(42)prior_samples = [L_prior @ np.random.randn(len(X_test)) for _ in range(5)] # Posteriormu_post, cov_post = gp_posterior(X_train, y_train, X_test, kernel_fn, noise_var=0.1)post_samples = sample_posterior(mu_post, cov_post, n_samples=5)post_std = np.sqrt(np.diag(cov_post)) # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Priorax = axes[0]for s in prior_samples: ax.plot(X_test, s, alpha=0.7)ax.fill_between(X_test, -2, 2, alpha=0.1, color='gray', label='±2σ region')ax.axhline(0, color='k', linestyle='--', alpha=0.3)ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.set_title('Prior: Before Seeing Data')ax.set_ylim([-3.5, 3.5])ax.grid(True, alpha=0.3) # Posteriorax = axes[1]for s in post_samples: ax.plot(X_test, s, alpha=0.7)ax.fill_between(X_test, mu_post - 2*post_std, mu_post + 2*post_std, alpha=0.2, color='blue', label='±2σ region')ax.plot(X_test, mu_post, 'k-', linewidth=2, label='Posterior mean')ax.scatter(X_train, y_train, c='red', s=80, zorder=5, edgecolors='black', label='Observations')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.set_title('Posterior: After Observing 5 Points')ax.set_ylim([-3.5, 3.5])ax.legend(loc='upper right')ax.grid(True, alpha=0.3) plt.suptitle('Prior to Posterior: How Observations Constrain Function Space', fontsize=14)plt.tight_layout()plt.show() print("Prior samples span all plausible functions.")print("Posterior samples are constrained to pass near observations.")Posterior samples represent 'functions that could have generated our data.' They're essential for: (1) visualizing uncertainty, (2) propagating uncertainty through downstream computations, (3) Monte Carlo estimation of expectations, and (4) Bayesian optimization. The posterior is where GPs truly shine—principled, data-driven uncertainty over functions.
We've now covered all aspects of sampling from Gaussian Process priors—from the basic algorithm to complex composed kernels and posterior sampling.
Congratulations! You've completed Module 1: GP Fundamentals. You now understand both the function space and weight space perspectives on GPs, the formal GP definition, how mean and covariance functions specify the prior, and how to sample and visualize GP priors. The next module covers GP Regression—applying these fundamentals to learn from data.
You've mastered sampling from Gaussian Process priors. This skill underlies all GP visualization, prior predictive checking, and posterior analysis. You're now prepared to apply GP fundamentals to regression problems in the next module.