Loading content...
Imagine you're asked to predict the price of a house given its square footage. Traditional approaches would have you choose a specific function—perhaps a line, a polynomial, or a neural network—and then fit parameters to your data. But what if, instead of committing to a single function, you could reason about all possible functions at once?
This is precisely what Gaussian Processes (GPs) offer: a principled probabilistic framework for defining distributions over the space of functions. Rather than learning point estimates of parameters, GPs maintain complete probability distributions that express our uncertainty about which function generated our data.
This shift in perspective—from "find the best function" to "maintain beliefs about all functions"—is subtle but profound. It transforms regression from an optimization problem into an inference problem, bringing the full power of Bayesian reasoning to bear on function estimation.
By the end of this page, you will understand what it means for a Gaussian Process to define a distribution over functions, how this differs fundamentally from parametric regression approaches, and why this perspective enables principled uncertainty quantification. You will develop the mathematical intuition necessary to work with infinite-dimensional probability distributions.
Before we can appreciate Gaussian Processes, we must understand the paradigm they transcend. In parametric regression, we specify a functional form with a finite number of parameters. Consider the familiar linear model:
$$f(x) = w_0 + w_1 x$$
Here, we've committed to a specific hypothesis class—the set of all linear functions—and our task is to find the optimal parameters $(w_0, w_1)$. This approach has served us well, but it carries fundamental limitations:
The deeper problem:
The parametric approach forces us to compress our beliefs about an infinite-dimensional object (a function) into a finite-dimensional parameter vector. This compression necessarily loses information. Even with Bayesian treatment where we maintain distributions over parameters, we cannot express beliefs about functions outside our chosen hypothesis class.
Consider trying to fit the function $f(x) = \sin(x)$ with a third-degree polynomial. No matter how much data we collect, our model cannot represent the true function—the polynomial hypothesis class simply doesn't contain sinusoids. We've introduced an inductive bias that no data can overcome.
What if we could avoid choosing a functional form entirely?
A nonparametric approach doesn't mean 'no parameters'—it means the number of effective parameters grows with the data. Gaussian Processes are the canonical example: they define distributions over functions directly, sidestepping the model specification problem while providing principled uncertainty quantification.
The conceptual leap that defines Gaussian Processes is elegantly simple: treat the unknown function itself as a random variable. Just as we can define probability distributions over numbers (like the Gaussian distribution over real numbers) or vectors (like the multivariate Gaussian), we can define probability distributions over functions.
But what does this mean, precisely? A function $f: \mathcal{X} \rightarrow \mathbb{R}$ maps inputs to outputs. To treat $f$ as a random variable, we imagine there's a probability distribution $p(f)$ from which functions are "drawn." Some functions have high probability (they're more likely to be the true function); others have low probability.
The key insight: Although functions are infinite-dimensional objects (they assign a value to every point in their domain), we only ever query them at finitely many points. If we care about the function values $f(x_1), f(x_2), \ldots, f(x_n)$ at specific inputs, we need only specify the joint distribution over these finite collections.
Here lies the genius of Gaussian Processes: we define a distribution over functions by specifying how all finite-dimensional marginals behave. If we can say how function values at any finite collection of points are jointly distributed, we have implicitly defined a distribution over functions.
Formal definition:
A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. Mathematically, $f \sim \mathcal{GP}(m, k)$ means that for any finite set of inputs ${x_1, x_2, \ldots, x_n}$, the corresponding function values are jointly Gaussian:
$$\begin{pmatrix} f(x_1) \ f(x_2) \ \vdots \ f(x_n) \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} m(x_1) \ m(x_2) \ \vdots \ m(x_n) \end{pmatrix}, \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \ \vdots & \vdots & \ddots & \vdots \ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix} \right)$$
Here:
| Component | Notation | Role | Interpretation |
|---|---|---|---|
| Mean Function | $m(x)$ | Specifies expected function value | Prior belief about "typical" function behavior |
| Covariance Function | $k(x, x')$ | Specifies correlation structure | How function values at nearby inputs relate |
| Function Values | $f(x_1), \ldots, f(x_n)$ | Random variables we infer | The outputs we predict at query points |
| Observations | $y_1, \ldots, y_n$ | Noisy measurements | Actual data we condition on |
The kernel function $k(x, x')$ is the heart of a Gaussian Process. While the mean function captures our prior belief about "where" functions tend to be, the kernel captures our beliefs about the structure and smoothness of functions.
Intuition: The kernel $k(x, x')$ tells us how "similar" the function values at inputs $x$ and $x'$ should be. If $x$ and $x'$ are nearby (however we define "nearby" in input space), and we believe functions tend to be smooth, then $f(x)$ and $f(x')$ should be highly correlated.
Consider the most common kernel—the Squared Exponential (or RBF) kernel:
$$k(x, x') = \sigma^2 \exp\left( -\frac{|x - x'|^2}{2\ell^2} \right)$$
This kernel exhibits key properties:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as plt def squared_exponential_kernel(x1, x2, sigma=1.0, length_scale=1.0): """ Squared Exponential (RBF) kernel. Parameters: - x1, x2: Input points (can be arrays for vectorized computation) - sigma: Signal variance (controls function magnitude) - length_scale: Controls smoothness (larger = smoother) Returns: - Covariance between f(x1) and f(x2) """ sqdist = np.sum((x1 - x2) ** 2) return sigma ** 2 * np.exp(-sqdist / (2 * length_scale ** 2)) def compute_kernel_matrix(X, sigma=1.0, length_scale=1.0): """ Compute the full kernel matrix K where K[i,j] = k(x_i, x_j). This matrix captures all pairwise correlations between function values at the input points. It's the covariance matrix of the multivariate Gaussian over function values. """ n = len(X) K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = squared_exponential_kernel( X[i], X[j], sigma, length_scale ) return K # Visualize how length scale affects function smoothnessnp.random.seed(42)X = np.linspace(-5, 5, 100).reshape(-1, 1) fig, axes = plt.subplots(1, 3, figsize=(15, 4))length_scales = [0.5, 1.0, 3.0] for ax, ls in zip(axes, length_scales): # Compute kernel matrix with this length scale K = compute_kernel_matrix(X, sigma=1.0, length_scale=ls) # Add small diagonal for numerical stability K += 1e-8 * np.eye(len(X)) # Sample 5 functions from the GP prior samples = np.random.multivariate_normal( mean=np.zeros(len(X)), cov=K, size=5 ) for sample in samples: ax.plot(X, sample, alpha=0.7) ax.set_xlim(-5, 5) ax.set_ylim(-3, 3) ax.set_title(f'Length Scale = {ls}') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('GP Prior Samples with Different Length Scales', fontsize=14)plt.tight_layout()plt.show()The kernel encodes our inductive bias:
Unlike parametric models where we choose a functional form (linear, polynomial, etc.), GPs encode assumptions through the kernel. The kernel answers questions like:
This is a fundamentally different way to inject domain knowledge. Rather than saying "the function is a polynomial of degree 3," we say "the function should be smooth, with correlation decaying over a characteristic length scale." The former restricts us to a 4-dimensional family; the latter allows infinite flexibility while still incorporating prior knowledge.
To build intuition about GP priors, let's understand how to draw samples from them. Although a GP defines a distribution over functions (infinite-dimensional objects), we can always evaluate samples at a finite set of points.
The procedure:
The resulting vector $\mathbf{f}$ gives us function values at the chosen points. By plotting these values and connecting them, we visualize a sample function.
Computational note: Sampling from a multivariate Gaussian requires computing the Cholesky decomposition of $K$: $$\mathbf{f} = \mathbf{m} + L\mathbf{z}$$ where $K = LL^T$ and $\mathbf{z} \sim \mathcal{N}(0, I)$ is a vector of independent standard normals.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npimport matplotlib.pyplot as plt def sample_gp_prior(X, mean_func=None, kernel_func=None, n_samples=5, **kernel_params): """ Sample functions from a Gaussian Process prior. Parameters: - X: Input points (n x d array) - mean_func: Mean function m(x), defaults to zero - kernel_func: Kernel function k(x, x'), defaults to SE kernel - n_samples: Number of function samples to draw - kernel_params: Parameters passed to kernel function Returns: - Sampled function values at X (n_samples x n array) """ n = len(X) # Default to zero mean if mean_func is None: mean = np.zeros(n) else: mean = np.array([mean_func(x) for x in X]) # Default to squared exponential kernel if kernel_func is None: sigma = kernel_params.get('sigma', 1.0) length_scale = kernel_params.get('length_scale', 1.0) # Build kernel matrix K = np.zeros((n, n)) for i in range(n): for j in range(n): sqdist = np.sum((X[i] - X[j]) ** 2) K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2)) else: K = np.array([[kernel_func(X[i], X[j], **kernel_params) for j in range(n)] for i in range(n)]) # Add jitter for numerical stability K += 1e-8 * np.eye(n) # Cholesky decomposition for efficient sampling L = np.linalg.cholesky(K) # Sample: f = mean + L @ z where z ~ N(0, I) Z = np.random.randn(n, n_samples) samples = mean[:, np.newaxis] + L @ Z return samples.T # Shape: (n_samples, n) # Demonstrate GP prior samplingnp.random.seed(42) # Create dense evaluation pointsX = np.linspace(-5, 5, 200).reshape(-1, 1) # Sample from GP prior with different kernel settingsfig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Configuration 1: Standard SE kernelsamples1 = sample_gp_prior(X, sigma=1.0, length_scale=1.0, n_samples=10)axes[0, 0].set_title('SE Kernel: σ=1.0, ℓ=1.0')for sample in samples1: axes[0, 0].plot(X, sample, alpha=0.6)axes[0, 0].set_xlabel('x')axes[0, 0].set_ylabel('f(x)')axes[0, 0].grid(True, alpha=0.3) # Configuration 2: Short length scale (rough functions)samples2 = sample_gp_prior(X, sigma=1.0, length_scale=0.3, n_samples=10)axes[0, 1].set_title('SE Kernel: σ=1.0, ℓ=0.3 (rougher)')for sample in samples2: axes[0, 1].plot(X, sample, alpha=0.6)axes[0, 1].set_xlabel('x')axes[0, 1].set_ylabel('f(x)')axes[0, 1].grid(True, alpha=0.3) # Configuration 3: Large variancesamples3 = sample_gp_prior(X, sigma=2.0, length_scale=1.0, n_samples=10)axes[1, 0].set_title('SE Kernel: σ=2.0, ℓ=1.0 (larger variance)')for sample in samples3: axes[1, 0].plot(X, sample, alpha=0.6)axes[1, 0].set_xlabel('x')axes[1, 0].set_ylabel('f(x)')axes[1, 0].grid(True, alpha=0.3) # Configuration 4: Long length scale (very smooth)samples4 = sample_gp_prior(X, sigma=1.0, length_scale=3.0, n_samples=10)axes[1, 1].set_title('SE Kernel: σ=1.0, ℓ=3.0 (smoother)')for sample in samples4: axes[1, 1].plot(X, sample, alpha=0.6)axes[1, 1].set_xlabel('x')axes[1, 1].set_ylabel('f(x)')axes[1, 1].grid(True, alpha=0.3) plt.suptitle('Samples from GP Prior', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()Each sample from a GP prior is a plausible function before seeing any data. The prior encodes our beliefs about what functions are 'reasonable.' Short length scales allow rapid variation; long length scales enforce smoothness. The signal variance controls the overall magnitude. By examining prior samples, we can validate whether our kernel choices match domain expectations.
Let's formalize the mathematical structure underlying Gaussian Processes. This foundation is essential for understanding posterior inference and making rigorous predictions.
Stochastic Process Definition:
A stochastic process is a collection of random variables indexed by some set. For a GP, this index set is the input domain $\mathcal{X}$ (often $\mathbb{R}^d$). We write ${f(x) : x \in \mathcal{X}}$ to denote this collection.
Gaussian Process Definition (Formal):
A stochastic process ${f(x) : x \in \mathcal{X}}$ is a Gaussian Process if, for every finite collection of indices ${x_1, \ldots, x_n} \subset \mathcal{X}$, the random vector $(f(x_1), \ldots, f(x_n))$ follows a multivariate Gaussian distribution.
This definition is remarkable in its simplicity: we only require that finite marginals be Gaussian. The Kolmogorov extension theorem guarantees that this is sufficient to define a valid probability distribution over functions.
For a GP to be well-defined, the kernel function must be positive semi-definite: for any finite set of inputs, the resulting kernel matrix must have non-negative eigenvalues. This ensures the covariance matrix is valid. Mercer's theorem, which we covered in Module 1, provides the theoretical foundation for valid kernel functions.
Properties of the Multivariate Gaussian:
To work with GPs, we need facility with multivariate Gaussian distributions. Key properties:
1. Marginalization: If $(\mathbf{f}_1, \mathbf{f}_2)$ is jointly Gaussian, then each marginal is also Gaussian:
$$\begin{pmatrix} \mathbf{f}1 \ \mathbf{f}2 \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{\mu}1 \ \boldsymbol{\mu}2 \end{pmatrix}, \begin{pmatrix} \Sigma{11} & \Sigma{12} \ \Sigma{21} & \Sigma{22} \end{pmatrix} \right)$$
implies
$$\mathbf{f}_1 \sim \mathcal{N}(\boldsymbol{\mu}1, \Sigma{11})$$
2. Conditioning: The conditional distribution of $\mathbf{f}_1$ given $\mathbf{f}_2$ is also Gaussian:
$$\mathbf{f}1 | \mathbf{f}2 \sim \mathcal{N}(\boldsymbol{\mu}{1|2}, \Sigma{1|2})$$
where:
These formulas are the engine of GP inference. When we observe data, we condition the GP on those observations to obtain a posterior distribution over functions.
| Operation | Input | Output | Complexity |
|---|---|---|---|
| Marginalization | Joint distribution | Marginal distribution | Select submatrix: $O(1)$ |
| Conditioning | Joint + observed values | Conditional distribution | Matrix inversion: $O(n^3)$ |
| Sampling | Mean + Covariance | Sample vector | Cholesky: $O(n^3)$, then $O(n^2)$ per sample |
| Log-likelihood | Data + mean + covariance | Scalar | Cholesky + solve: $O(n^3)$ |
A natural question arises: how can we work with infinite-dimensional objects (functions) using finite-dimensional tools (linear algebra)? The answer lies in the consistency of GP marginals.
The consistency principle:
Suppose we have observed function values at points ${x_1, \ldots, x_n}$. If we later query the function at a new point $x_{n+1}$, the GP gives us a joint distribution over ${f(x_1), \ldots, f(x_{n+1})}$. The crucial property is that the marginal distribution over the original $n$ points is exactly what the GP specified before—adding new query points doesn't change predictions for old points.
This consistency means we can:
The function space perspective:
While we implement GPs using finite-dimensional linear algebra, it helps to visualize the underlying infinite-dimensional structure. Consider the space of all continuous functions $C(\mathcal{X})$ equipped with a suitable topology. A GP defines a probability measure on this space such that:
There's an alternative perspective connecting GPs to neural networks. A GP with kernel $k$ is equivalent to Bayesian linear regression in the feature space induced by $k$. As the number of features (or hidden units in a neural network) goes to infinity, many priors converge to GPs. This 'weight-space' view provides complementary intuition and connects GPs to deep learning theory.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npimport matplotlib.pyplot as plt def demonstrate_gp_consistency(): """ Demonstrate that GP distributions are consistent: adding new query points doesn't change predictions at existing points. """ np.random.seed(42) # Kernel function def se_kernel(x1, x2, sigma=1.0, length_scale=1.0): sqdist = np.sum((x1 - x2) ** 2) return sigma**2 * np.exp(-sqdist / (2 * length_scale**2)) # Build kernel matrix def build_kernel_matrix(X1, X2=None, **params): if X2 is None: X2 = X1 n1, n2 = len(X1), len(X2) K = np.zeros((n1, n2)) for i in range(n1): for j in range(n2): K[i, j] = se_kernel(X1[i], X2[j], **params) return K # Original points X_original = np.array([[0.0], [1.0], [2.0]]) K_original = build_kernel_matrix(X_original) # Extended points (add two new locations) X_extended = np.array([[0.0], [0.5], [1.0], [1.5], [2.0]]) K_extended = build_kernel_matrix(X_extended) # Extract marginal corresponding to original points # Original points are at indices [0, 2, 4] original_indices = [0, 2, 4] K_marginal = K_extended[np.ix_(original_indices, original_indices)] # Verify consistency print("Original kernel matrix (3 points):") print(K_original) print("Marginal from extended matrix:") print(K_marginal) print("Difference (should be ~zero):") print(K_original - K_marginal) print(f"Max absolute difference: {np.max(np.abs(K_original - K_marginal)):.2e}") # Run demonstrationdemonstrate_gp_consistency() # Visualize how predictions remain consistentfig, axes = plt.subplots(1, 2, figsize=(14, 5)) np.random.seed(42) # Sparse evaluationX_sparse = np.linspace(-3, 3, 20).reshape(-1, 1)K_sparse = np.zeros((len(X_sparse), len(X_sparse)))for i in range(len(X_sparse)): for j in range(len(X_sparse)): sqdist = np.sum((X_sparse[i] - X_sparse[j]) ** 2) K_sparse[i, j] = np.exp(-sqdist / 2)K_sparse += 1e-8 * np.eye(len(X_sparse))L_sparse = np.linalg.cholesky(K_sparse) # Sample sparsez = np.random.randn(len(X_sparse))f_sparse = L_sparse @ z axes[0].plot(X_sparse, f_sparse, 'o-', markersize=8)axes[0].set_title('GP Sample at 20 Points')axes[0].set_xlabel('x')axes[0].set_ylabel('f(x)')axes[0].grid(True, alpha=0.3) # Dense evaluation - same function, more pointsX_dense = np.linspace(-3, 3, 200).reshape(-1, 1)X_combined = np.vstack([X_sparse, X_dense]) # For the same sample, we need to condition on the sparse values# This demonstrates consistency - interpolating between known valuesK_cross = np.zeros((len(X_dense), len(X_sparse)))for i in range(len(X_dense)): for j in range(len(X_sparse)): sqdist = np.sum((X_dense[i] - X_sparse[j]) ** 2) K_cross[i, j] = np.exp(-sqdist / 2) K_dense = np.zeros((len(X_dense), len(X_dense)))for i in range(len(X_dense)): for j in range(len(X_dense)): sqdist = np.sum((X_dense[i] - X_dense[j]) ** 2) K_dense[i, j] = np.exp(-sqdist / 2) # Conditional mean (interpolation)f_dense = K_cross @ np.linalg.solve(K_sparse, f_sparse) axes[1].plot(X_dense, f_dense, 'b-', alpha=0.8, label='Interpolated (200 pts)')axes[1].plot(X_sparse, f_sparse, 'ro', markersize=8, label='Original (20 pts)')axes[1].set_title('Same Function at 200 Points (Consistent)')axes[1].set_xlabel('x')axes[1].set_ylabel('f(x)')axes[1].legend()axes[1].grid(True, alpha=0.3) plt.tight_layout()plt.show()Given that we want to define distributions over functions, why specifically Gaussian? Several compelling reasons make the Gaussian choice natural and practical:
1. Analytical Tractability:
Gaussian distributions are the only distributions that are closed under both marginalization and conditioning. This means:
For GPs, this closedness is essential. We need to marginalize (ignore some function values) and condition (observe some function values) constantly. Any other distribution family would require approximations.
2. Maximum Entropy:
For a fixed mean and covariance, the Gaussian distribution has maximum entropy. This means it's the "least informative" distribution consistent with our specified first and second moments. If we only want to encode beliefs about mean and covariance structure, Gaussian is the principled choice—we're not injecting additional assumptions.
3. Central Limit Theorem:
Under mild conditions, sums of many independent random variables converge to Gaussians. Since many real-world quantities arise as aggregates of many small effects, Gaussian models are often reasonable approximations to reality.
Gaussian distributions are symmetric and have thin tails. This isn't always appropriate—some phenomena exhibit heavy tails, skewness, or multimodality. If we need distributions over functions that can capture these features, we must move beyond standard GPs to Student-t processes, mixtures of GPs, or other generalizations. We'll discuss such extensions in advanced modules.
4. Computational Efficiency:
Working with Gaussians involves linear algebra operations: matrix multiplication, inversion, and Cholesky decomposition. These operations are:
Contrast this with distributions requiring numerical integration or sampling—Gaussians admit closed-form solutions.
5. Complete Specification:
A Gaussian distribution is fully characterized by its first two moments—mean and covariance. Higher moments are determined by these. This means specifying a GP requires only:
We don't need to specify third moments (skewness), fourth moments (kurtosis), or any higher-order structure. The kernel alone captures everything about the distribution's shape.
| Property | Statement | Consequence for GPs |
|---|---|---|
| Closure under marginalization | Marginals of Gaussians are Gaussian | Can query any subset of function values |
| Closure under conditioning | Conditionals of Gaussians are Gaussian | Posterior after observations is still a GP |
| Closure under linear transformations | Linear combinations are Gaussian | Derivatives, integrals of GPs are GPs |
| Maximum entropy | Least informative for fixed mean/cov | Principled when we only specify correlation structure |
| Two-parameter family | Fully specified by μ and Σ | Kernel encodes complete prior |
To develop intuition about GP priors, let's visualize the distribution over functions as confidence bands, not just individual samples.
At each input $x$, the GP prior gives us a marginal distribution for $f(x)$. With zero mean and kernel $k$:
$$f(x) \sim \mathcal{N}(0, k(x, x)) = \mathcal{N}(0, \sigma^2)$$
This marginal tells us: before seeing any data, each function value is distributed around zero with variance $\sigma^2$. But this marginal alone doesn't capture the correlation structure—for that, we need to examine how function values at different inputs co-vary.
Confidence bands:
For a GP prior $f \sim \mathcal{GP}(0, k)$, we can plot:
These bands represent where 95% of function values fall at each point, assuming uncorrelated marginals. The bands don't capture correlation—they just show marginal uncertainty.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npimport matplotlib.pyplot as plt def visualize_gp_prior(length_scale=1.0, sigma=1.0, n_samples=20): """ Comprehensive visualization of a GP prior. Shows mean, confidence bands, and sample functions. """ np.random.seed(42) # Dense evaluation grid X = np.linspace(-5, 5, 200).reshape(-1, 1) n = len(X) # Build kernel matrix K = np.zeros((n, n)) for i in range(n): for j in range(n): sqdist = np.sum((X[i] - X[j]) ** 2) K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2)) # Add jitter for numerical stability K += 1e-8 * np.eye(n) # Mean function (zero) mean = np.zeros(n) # Pointwise standard deviation std = np.sqrt(np.diag(K)) # Sample functions L = np.linalg.cholesky(K) samples = mean[:, np.newaxis] + L @ np.random.randn(n, n_samples) # Create figure fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: Mean and confidence bands ax = axes[0, 0] ax.fill_between(X.ravel(), mean - 2*std, mean + 2*std, alpha=0.2, color='C0', label='±2σ (95%)') ax.fill_between(X.ravel(), mean - std, mean + std, alpha=0.3, color='C0', label='±1σ (68%)') ax.plot(X, mean, 'C0-', linewidth=2, label='Mean') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.set_title('GP Prior: Mean and Confidence Bands') ax.legend() ax.grid(True, alpha=0.3) ax.set_ylim(-4, 4) # Plot 2: Sample functions ax = axes[0, 1] for i in range(min(10, n_samples)): ax.plot(X, samples[:, i], alpha=0.7, linewidth=1.5) ax.plot(X, mean, 'k--', linewidth=2, label='Prior mean') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.set_title(f'10 Sample Functions (ℓ={length_scale})') ax.legend() ax.grid(True, alpha=0.3) ax.set_ylim(-4, 4) # Plot 3: Kernel function ax = axes[1, 0] x_ref = 0.0 kernel_vals = [sigma**2 * np.exp(-(x - x_ref)**2 / (2 * length_scale**2)) for x in X.ravel()] ax.plot(X, kernel_vals, 'C1-', linewidth=2) ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5) ax.axvline(x=x_ref, color='gray', linestyle='--', alpha=0.5) ax.fill_between(X.ravel(), 0, kernel_vals, alpha=0.3, color='C1') ax.set_xlabel("x'") ax.set_ylabel(f"k({x_ref}, x')") ax.set_title(f'Covariance Function (reference x={x_ref})') ax.grid(True, alpha=0.3) # Plot 4: Correlation at different separations ax = axes[1, 1] separations = np.linspace(0, 4, 100) correlations = np.exp(-separations**2 / (2 * length_scale**2)) ax.plot(separations, correlations, 'C2-', linewidth=2) ax.axhline(y=np.exp(-0.5), color='gray', linestyle='--', alpha=0.5) ax.axvline(x=length_scale, color='gray', linestyle='--', alpha=0.5) ax.annotate(f'Length scale = {length_scale}', xy=(length_scale, np.exp(-0.5)), xytext=(length_scale + 0.5, 0.3), arrowprops=dict(arrowstyle='->', color='gray'), fontsize=10) ax.set_xlabel('|x - x'|') ax.set_ylabel('Correlation') ax.set_title('Correlation Decay with Distance') ax.grid(True, alpha=0.3) plt.suptitle(f'Gaussian Process Prior (σ={sigma}, ℓ={length_scale})', fontsize=14, fontweight='bold') plt.tight_layout() plt.show() # Visualize priors with different hyperparametersvisualize_gp_prior(length_scale=0.5, sigma=1.0) # Short length scalevisualize_gp_prior(length_scale=2.0, sigma=1.0) # Long length scaleExamining prior samples is crucial for model development. If prior samples look unreasonable—too wiggly, too smooth, wrong scale—the posterior will inherit these problems. Always visualize prior samples before fitting to data.
We've explored the fundamental concept underlying Gaussian Processes: treating functions as random variables drawn from probability distributions. This perspective shift enables principled uncertainty quantification and flexible nonparametric modeling.
What's Next:
With the foundation of GP priors established, we're ready to tackle the next crucial component: prior specification. We'll examine how to choose mean functions and kernels that encode appropriate domain knowledge, exploring the rich space of kernel designs and their implications for modeling.
You now understand the foundational concept of Gaussian Processes as distributions over functions. This perspective—treating functions as random variables—is the key insight that makes GPs powerful for Bayesian nonparametric regression. Next, we'll learn how to specify GP priors through careful choice of mean and kernel functions.