Gaussian Processes Introduction - Learning Module

Loading content...

0/278

GP as Distribution Over Functions

A Revolutionary Perspective on Regression

Imagine you're asked to predict the price of a house given its square footage. Traditional approaches would have you choose a specific function—perhaps a line, a polynomial, or a neural network—and then fit parameters to your data. But what if, instead of committing to a single function, you could reason about all possible functions at once?

This is precisely what Gaussian Processes (GPs) offer: a principled probabilistic framework for defining distributions over the space of functions. Rather than learning point estimates of parameters, GPs maintain complete probability distributions that express our uncertainty about which function generated our data.

This shift in perspective—from "find the best function" to "maintain beliefs about all functions"—is subtle but profound. It transforms regression from an optimization problem into an inference problem, bringing the full power of Bayesian reasoning to bear on function estimation.

What You Will Learn

By the end of this page, you will understand what it means for a Gaussian Process to define a distribution over functions, how this differs fundamentally from parametric regression approaches, and why this perspective enables principled uncertainty quantification. You will develop the mathematical intuition necessary to work with infinite-dimensional probability distributions.

The Parametric Paradigm and Its Limitations

Before we can appreciate Gaussian Processes, we must understand the paradigm they transcend. In parametric regression, we specify a functional form with a finite number of parameters. Consider the familiar linear model:

$$f(x) = w_0 + w_1 x$$

Here, we've committed to a specific hypothesis class—the set of all linear functions—and our task is to find the optimal parameters $(w_0, w_1)$. This approach has served us well, but it carries fundamental limitations:

Limitations of Parametric Approaches

•Model Specification Burden — We must choose the functional form before seeing the data. Is the relationship linear? Quadratic? Sinusoidal? The wrong choice leads to systematic errors no amount of data can correct.
•Limited Expressiveness — The number of parameters determines the complexity of patterns we can capture. A linear model has 2 parameters; a degree-10 polynomial has 11. We must balance expressiveness against overfitting.
•Point Estimates of Uncertainty — Standard parametric regression gives us a single "best" function. Even with confidence intervals on parameters, we lose the full picture of uncertainty about the function itself.
•Brittleness at Boundaries — Parametric models often extrapolate poorly. A polynomial fit can oscillate wildly outside the training data range, giving us confidence where none is warranted.

The deeper problem:

The parametric approach forces us to compress our beliefs about an infinite-dimensional object (a function) into a finite-dimensional parameter vector. This compression necessarily loses information. Even with Bayesian treatment where we maintain distributions over parameters, we cannot express beliefs about functions outside our chosen hypothesis class.

Consider trying to fit the function $f(x) = \sin(x)$ with a third-degree polynomial. No matter how much data we collect, our model cannot represent the true function—the polynomial hypothesis class simply doesn't contain sinusoids. We've introduced an inductive bias that no data can overcome.

What if we could avoid choosing a functional form entirely?

The Nonparametric Promise

A nonparametric approach doesn't mean 'no parameters'—it means the number of effective parameters grows with the data. Gaussian Processes are the canonical example: they define distributions over functions directly, sidestepping the model specification problem while providing principled uncertainty quantification.

Functions as Random Variables

The conceptual leap that defines Gaussian Processes is elegantly simple: treat the unknown function itself as a random variable. Just as we can define probability distributions over numbers (like the Gaussian distribution over real numbers) or vectors (like the multivariate Gaussian), we can define probability distributions over functions.

But what does this mean, precisely? A function $f: \mathcal{X} \rightarrow \mathbb{R}$ maps inputs to outputs. To treat $f$ as a random variable, we imagine there's a probability distribution $p(f)$ from which functions are "drawn." Some functions have high probability (they're more likely to be the true function); others have low probability.

The key insight: Although functions are infinite-dimensional objects (they assign a value to every point in their domain), we only ever query them at finitely many points. If we care about the function values $f(x_1), f(x_2), \ldots, f(x_n)$ at specific inputs, we need only specify the joint distribution over these finite collections.

The Finite-Dimensional Projection

Here lies the genius of Gaussian Processes: we define a distribution over functions by specifying how all finite-dimensional marginals behave. If we can say how function values at any finite collection of points are jointly distributed, we have implicitly defined a distribution over functions.

Formal definition:

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. Mathematically, $f \sim \mathcal{GP}(m, k)$ means that for any finite set of inputs ${x_1, x_2, \ldots, x_n}$, the corresponding function values are jointly Gaussian:

$$\begin{pmatrix} f(x_1) \ f(x_2) \ \vdots \ f(x_n) \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} m(x_1) \ m(x_2) \ \vdots \ m(x_n) \end{pmatrix}, \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \ \vdots & \vdots & \ddots & \vdots \ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix} \right)$$

Here:

$m(x)$ is the mean function: our prior belief about the expected value of $f$ at any input $x$
$k(x, x')$ is the covariance function (or kernel): our belief about how function values at different inputs co-vary

Components of a Gaussian Process
Component	Notation	Role	Interpretation
Mean Function	$m(x)$	Specifies expected function value	Prior belief about "typical" function behavior
Covariance Function	$k(x, x')$	Specifies correlation structure	How function values at nearby inputs relate
Function Values	$f(x_1), \ldots, f(x_n)$	Random variables we infer	The outputs we predict at query points
Observations	$y_1, \ldots, y_n$	Noisy measurements	Actual data we condition on

The Role of the Kernel

The kernel function $k(x, x')$ is the heart of a Gaussian Process. While the mean function captures our prior belief about "where" functions tend to be, the kernel captures our beliefs about the structure and smoothness of functions.

Intuition: The kernel $k(x, x')$ tells us how "similar" the function values at inputs $x$ and $x'$ should be. If $x$ and $x'$ are nearby (however we define "nearby" in input space), and we believe functions tend to be smooth, then $f(x)$ and $f(x')$ should be highly correlated.

Consider the most common kernel—the Squared Exponential (or RBF) kernel:

$$k(x, x') = \sigma^2 \exp\left( -\frac{|x - x'|^2}{2\ell^2} \right)$$

This kernel exhibits key properties:

Squared Exponential Kernel Properties

•Maximum at Identity — $k(x, x) = \sigma^2$: the variance of any function value. This sets the "scale" of function variation.
•Decay with Distance — As $|x - x'|$ increases, $k(x, x')$ approaches zero. Distant inputs become uncorrelated.
•Length Scale $\ell$ — Controls how quickly correlation decays. Larger $\ell$ means smoother functions; smaller $\ell$ allows more rapid variation.
•Stationarity — $k(x, x')$ depends only on the displacement $x - x'$, not on absolute position. The function has the same smoothness properties everywhere.
•Infinite Smoothness — Functions drawn from this GP are infinitely differentiable. This may or may not match reality—hence why kernel choice matters.

kernel_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
 
def squared_exponential_kernel(x1, x2, sigma=1.0, length_scale=1.0):
    """
    Squared Exponential (RBF) kernel.
    
    Parameters:
    - x1, x2: Input points (can be arrays for vectorized computation)
    - sigma: Signal variance (controls function magnitude)
    - length_scale: Controls smoothness (larger = smoother)
    
    Returns:
    - Covariance between f(x1) and f(x2)
    """
    sqdist = np.sum((x1 - x2) ** 2)
    return sigma ** 2 * np.exp(-sqdist / (2 * length_scale ** 2))
 
def compute_kernel_matrix(X, sigma=1.0, length_scale=1.0):
    """
    Compute the full kernel matrix K where K[i,j] = k(x_i, x_j).
    
    This matrix captures all pairwise correlations between function
    values at the input points. It's the covariance matrix of the
    multivariate Gaussian over function values.
    """
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = squared_exponential_kernel(
                X[i], X[j], sigma, length_scale
            )
    return K
 
# Visualize how length scale affects function smoothness
np.random.seed(42)
X = np.linspace(-5, 5, 100).reshape(-1, 1)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
length_scales = [0.5, 1.0, 3.0]
 
for ax, ls in zip(axes, length_scales):
    # Compute kernel matrix with this length scale
    K = compute_kernel_matrix(X, sigma=1.0, length_scale=ls)
    
    # Add small diagonal for numerical stability
    K += 1e-8 * np.eye(len(X))
    
    # Sample 5 functions from the GP prior
    samples = np.random.multivariate_normal(
        mean=np.zeros(len(X)), cov=K, size=5
    )
    
    for sample in samples:
        ax.plot(X, sample, alpha=0.7)
    
    ax.set_xlim(-5, 5)
    ax.set_ylim(-3, 3)
    ax.set_title(f'Length Scale = {ls}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('GP Prior Samples with Different Length Scales', fontsize=14)
plt.tight_layout()
plt.show()

The kernel encodes our inductive bias:

Unlike parametric models where we choose a functional form (linear, polynomial, etc.), GPs encode assumptions through the kernel. The kernel answers questions like:

How smooth should the function be?
Should it have periodic components?
Are some input dimensions more important than others?
How does variance change across input space?

This is a fundamentally different way to inject domain knowledge. Rather than saying "the function is a polynomial of degree 3," we say "the function should be smooth, with correlation decaying over a characteristic length scale." The former restricts us to a 4-dimensional family; the latter allows infinite flexibility while still incorporating prior knowledge.

Sampling from a GP Prior

To build intuition about GP priors, let's understand how to draw samples from them. Although a GP defines a distribution over functions (infinite-dimensional objects), we can always evaluate samples at a finite set of points.

The procedure:

Choose evaluation points $X = {x_1, x_2, \ldots, x_n}$
Compute mean vector $\mathbf{m} = [m(x_1), m(x_2), \ldots, m(x_n)]^T$ (often zero)
Compute kernel matrix $K$ where $K_{ij} = k(x_i, x_j)$
Sample from multivariate Gaussian $\mathbf{f} \sim \mathcal{N}(\mathbf{m}, K)$

The resulting vector $\mathbf{f}$ gives us function values at the chosen points. By plotting these values and connecting them, we visualize a sample function.

Computational note: Sampling from a multivariate Gaussian requires computing the Cholesky decomposition of $K$: $$\mathbf{f} = \mathbf{m} + L\mathbf{z}$$ where $K = LL^T$ and $\mathbf{z} \sim \mathcal{N}(0, I)$ is a vector of independent standard normals.

gp_prior_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
import matplotlib.pyplot as plt
 
def sample_gp_prior(X, mean_func=None, kernel_func=None, n_samples=5, **kernel_params):
    """
    Sample functions from a Gaussian Process prior.
    
    Parameters:
    - X: Input points (n x d array)
    - mean_func: Mean function m(x), defaults to zero
    - kernel_func: Kernel function k(x, x'), defaults to SE kernel
    - n_samples: Number of function samples to draw
    - kernel_params: Parameters passed to kernel function
    
    Returns:
    - Sampled function values at X (n_samples x n array)
    """
    n = len(X)
    
    # Default to zero mean
    if mean_func is None:
        mean = np.zeros(n)
    else:
        mean = np.array([mean_func(x) for x in X])
    
    # Default to squared exponential kernel
    if kernel_func is None:
        sigma = kernel_params.get('sigma', 1.0)
        length_scale = kernel_params.get('length_scale', 1.0)
        
        # Build kernel matrix
        K = np.zeros((n, n))
        for i in range(n):
            for j in range(n):
                sqdist = np.sum((X[i] - X[j]) ** 2)
                K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2))
    else:
        K = np.array([[kernel_func(X[i], X[j], **kernel_params) 
                       for j in range(n)] for i in range(n)])
    
    # Add jitter for numerical stability
    K += 1e-8 * np.eye(n)
    
    # Cholesky decomposition for efficient sampling
    L = np.linalg.cholesky(K)
    
    # Sample: f = mean + L @ z where z ~ N(0, I)
    Z = np.random.randn(n, n_samples)
    samples = mean[:, np.newaxis] + L @ Z
    
    return samples.T  # Shape: (n_samples, n)
 
# Demonstrate GP prior sampling
np.random.seed(42)
 
# Create dense evaluation points
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
# Sample from GP prior with different kernel settings
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
 
# Configuration 1: Standard SE kernel
samples1 = sample_gp_prior(X, sigma=1.0, length_scale=1.0, n_samples=10)
axes[0, 0].set_title('SE Kernel: σ=1.0, ℓ=1.0')
for sample in samples1:
    axes[0, 0].plot(X, sample, alpha=0.6)
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('f(x)')
axes[0, 0].grid(True, alpha=0.3)
 
# Configuration 2: Short length scale (rough functions)
samples2 = sample_gp_prior(X, sigma=1.0, length_scale=0.3, n_samples=10)
axes[0, 1].set_title('SE Kernel: σ=1.0, ℓ=0.3 (rougher)')
for sample in samples2:
    axes[0, 1].plot(X, sample, alpha=0.6)
axes[0, 1].set_xlabel('x')
axes[0, 1].set_ylabel('f(x)')
axes[0, 1].grid(True, alpha=0.3)
 
# Configuration 3: Large variance
samples3 = sample_gp_prior(X, sigma=2.0, length_scale=1.0, n_samples=10)
axes[1, 0].set_title('SE Kernel: σ=2.0, ℓ=1.0 (larger variance)')
for sample in samples3:
    axes[1, 0].plot(X, sample, alpha=0.6)
axes[1, 0].set_xlabel('x')
axes[1, 0].set_ylabel('f(x)')
axes[1, 0].grid(True, alpha=0.3)
 
# Configuration 4: Long length scale (very smooth)
samples4 = sample_gp_prior(X, sigma=1.0, length_scale=3.0, n_samples=10)
axes[1, 1].set_title('SE Kernel: σ=1.0, ℓ=3.0 (smoother)')
for sample in samples4:
    axes[1, 1].plot(X, sample, alpha=0.6)
axes[1, 1].set_xlabel('x')
axes[1, 1].set_ylabel('f(x)')
axes[1, 1].grid(True, alpha=0.3)
 
plt.suptitle('Samples from GP Prior', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Understanding Prior Samples

Each sample from a GP prior is a plausible function before seeing any data. The prior encodes our beliefs about what functions are 'reasonable.' Short length scales allow rapid variation; long length scales enforce smoothness. The signal variance controls the overall magnitude. By examining prior samples, we can validate whether our kernel choices match domain expectations.

Mathematical Foundations

Let's formalize the mathematical structure underlying Gaussian Processes. This foundation is essential for understanding posterior inference and making rigorous predictions.

Stochastic Process Definition:

A stochastic process is a collection of random variables indexed by some set. For a GP, this index set is the input domain $\mathcal{X}$ (often $\mathbb{R}^d$). We write ${f(x) : x \in \mathcal{X}}$ to denote this collection.

Gaussian Process Definition (Formal):

A stochastic process ${f(x) : x \in \mathcal{X}}$ is a Gaussian Process if, for every finite collection of indices ${x_1, \ldots, x_n} \subset \mathcal{X}$, the random vector $(f(x_1), \ldots, f(x_n))$ follows a multivariate Gaussian distribution.

This definition is remarkable in its simplicity: we only require that finite marginals be Gaussian. The Kolmogorov extension theorem guarantees that this is sufficient to define a valid probability distribution over functions.

Consistency Conditions

For a GP to be well-defined, the kernel function must be positive semi-definite: for any finite set of inputs, the resulting kernel matrix must have non-negative eigenvalues. This ensures the covariance matrix is valid. Mercer's theorem, which we covered in Module 1, provides the theoretical foundation for valid kernel functions.

Properties of the Multivariate Gaussian:

To work with GPs, we need facility with multivariate Gaussian distributions. Key properties:

1. Marginalization: If $(\mathbf{f}_1, \mathbf{f}_2)$ is jointly Gaussian, then each marginal is also Gaussian:

$$\begin{pmatrix} \mathbf{f}1 \ \mathbf{f}2 \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{\mu}1 \ \boldsymbol{\mu}2 \end{pmatrix}, \begin{pmatrix} \Sigma{11} & \Sigma{12} \ \Sigma{21} & \Sigma{22} \end{pmatrix} \right)$$

implies

$$\mathbf{f}_1 \sim \mathcal{N}(\boldsymbol{\mu}1, \Sigma{11})$$

2. Conditioning: The conditional distribution of $\mathbf{f}_1$ given $\mathbf{f}_2$ is also Gaussian:

$$\mathbf{f}1 | \mathbf{f}2 \sim \mathcal{N}(\boldsymbol{\mu}{1|2}, \Sigma{1|2})$$

where:

$\boldsymbol{\mu}{1|2} = \boldsymbol{\mu}1 + \Sigma{12}\Sigma{22}^{-1}(\mathbf{f}_2 - \boldsymbol{\mu}_2)$
$\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$

These formulas are the engine of GP inference. When we observe data, we condition the GP on those observations to obtain a posterior distribution over functions.

Key Multivariate Gaussian Operations
Operation	Input	Output	Complexity
Marginalization	Joint distribution	Marginal distribution	Select submatrix: $O(1)$
Conditioning	Joint + observed values	Conditional distribution	Matrix inversion: $O(n^3)$
Sampling	Mean + Covariance	Sample vector	Cholesky: $O(n^3)$, then $O(n^2)$ per sample
Log-likelihood	Data + mean + covariance	Scalar	Cholesky + solve: $O(n^3)$

From Finite to Infinite Dimensions

A natural question arises: how can we work with infinite-dimensional objects (functions) using finite-dimensional tools (linear algebra)? The answer lies in the consistency of GP marginals.

The consistency principle:

Suppose we have observed function values at points ${x_1, \ldots, x_n}$. If we later query the function at a new point $x_{n+1}$, the GP gives us a joint distribution over ${f(x_1), \ldots, f(x_{n+1})}$. The crucial property is that the marginal distribution over the original $n$ points is exactly what the GP specified before—adding new query points doesn't change predictions for old points.

This consistency means we can:

Define distributions at arbitrary resolution
Add or remove evaluation points without contradictions
Make predictions at any new location by computing the appropriate conditional

The function space perspective:

While we implement GPs using finite-dimensional linear algebra, it helps to visualize the underlying infinite-dimensional structure. Consider the space of all continuous functions $C(\mathcal{X})$ equipped with a suitable topology. A GP defines a probability measure on this space such that:

Smooth functions (as encoded by the kernel) have high probability
Functions that violate our prior assumptions have low probability
The measure is consistent across all finite-dimensional projections

The Weight-Space View

There's an alternative perspective connecting GPs to neural networks. A GP with kernel $k$ is equivalent to Bayesian linear regression in the feature space induced by $k$. As the number of features (or hidden units in a neural network) goes to infinity, many priors converge to GPs. This 'weight-space' view provides complementary intuition and connects GPs to deep learning theory.

consistency_demonstration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_gp_consistency():
    """
    Demonstrate that GP distributions are consistent:
    adding new query points doesn't change predictions at existing points.
    """
    np.random.seed(42)
    
    # Kernel function
    def se_kernel(x1, x2, sigma=1.0, length_scale=1.0):
        sqdist = np.sum((x1 - x2) ** 2)
        return sigma**2 * np.exp(-sqdist / (2 * length_scale**2))
    
    # Build kernel matrix
    def build_kernel_matrix(X1, X2=None, **params):
        if X2 is None:
            X2 = X1
        n1, n2 = len(X1), len(X2)
        K = np.zeros((n1, n2))
        for i in range(n1):
            for j in range(n2):
                K[i, j] = se_kernel(X1[i], X2[j], **params)
        return K
    
    # Original points
    X_original = np.array([[0.0], [1.0], [2.0]])
    K_original = build_kernel_matrix(X_original)
    
    # Extended points (add two new locations)
    X_extended = np.array([[0.0], [0.5], [1.0], [1.5], [2.0]])
    K_extended = build_kernel_matrix(X_extended)
    
    # Extract marginal corresponding to original points
    # Original points are at indices [0, 2, 4]
    original_indices = [0, 2, 4]
    K_marginal = K_extended[np.ix_(original_indices, original_indices)]
    
    # Verify consistency
    print("Original kernel matrix (3 points):")
    print(K_original)
    print("
Marginal from extended matrix:")
    print(K_marginal)
    print("
Difference (should be ~zero):")
    print(K_original - K_marginal)
    print(f"
Max absolute difference: {np.max(np.abs(K_original - K_marginal)):.2e}")
 
# Run demonstration
demonstrate_gp_consistency()
 
# Visualize how predictions remain consistent
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
np.random.seed(42)
 
# Sparse evaluation
X_sparse = np.linspace(-3, 3, 20).reshape(-1, 1)
K_sparse = np.zeros((len(X_sparse), len(X_sparse)))
for i in range(len(X_sparse)):
    for j in range(len(X_sparse)):
        sqdist = np.sum((X_sparse[i] - X_sparse[j]) ** 2)
        K_sparse[i, j] = np.exp(-sqdist / 2)
K_sparse += 1e-8 * np.eye(len(X_sparse))
L_sparse = np.linalg.cholesky(K_sparse)
 
# Sample sparse
z = np.random.randn(len(X_sparse))
f_sparse = L_sparse @ z
 
axes[0].plot(X_sparse, f_sparse, 'o-', markersize=8)
axes[0].set_title('GP Sample at 20 Points')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].grid(True, alpha=0.3)
 
# Dense evaluation - same function, more points
X_dense = np.linspace(-3, 3, 200).reshape(-1, 1)
X_combined = np.vstack([X_sparse, X_dense])
 
# For the same sample, we need to condition on the sparse values
# This demonstrates consistency - interpolating between known values
K_cross = np.zeros((len(X_dense), len(X_sparse)))
for i in range(len(X_dense)):
    for j in range(len(X_sparse)):
        sqdist = np.sum((X_dense[i] - X_sparse[j]) ** 2)
        K_cross[i, j] = np.exp(-sqdist / 2)
 
K_dense = np.zeros((len(X_dense), len(X_dense)))
for i in range(len(X_dense)):
    for j in range(len(X_dense)):
        sqdist = np.sum((X_dense[i] - X_dense[j]) ** 2)
        K_dense[i, j] = np.exp(-sqdist / 2)
 
# Conditional mean (interpolation)
f_dense = K_cross @ np.linalg.solve(K_sparse, f_sparse)
 
axes[1].plot(X_dense, f_dense, 'b-', alpha=0.8, label='Interpolated (200 pts)')
axes[1].plot(X_sparse, f_sparse, 'ro', markersize=8, label='Original (20 pts)')
axes[1].set_title('Same Function at 200 Points (Consistent)')
axes[1].set_xlabel('x')
axes[1].set_ylabel('f(x)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()

Why Gaussian?

Given that we want to define distributions over functions, why specifically Gaussian? Several compelling reasons make the Gaussian choice natural and practical:

1. Analytical Tractability:

Gaussian distributions are the only distributions that are closed under both marginalization and conditioning. This means:

Every marginal of a Gaussian is Gaussian
Every conditional of a Gaussian is Gaussian
Linear combinations of Gaussians are Gaussian

For GPs, this closedness is essential. We need to marginalize (ignore some function values) and condition (observe some function values) constantly. Any other distribution family would require approximations.

2. Maximum Entropy:

For a fixed mean and covariance, the Gaussian distribution has maximum entropy. This means it's the "least informative" distribution consistent with our specified first and second moments. If we only want to encode beliefs about mean and covariance structure, Gaussian is the principled choice—we're not injecting additional assumptions.

3. Central Limit Theorem:

Under mild conditions, sums of many independent random variables converge to Gaussians. Since many real-world quantities arise as aggregates of many small effects, Gaussian models are often reasonable approximations to reality.

The Price of Gaussianity

Gaussian distributions are symmetric and have thin tails. This isn't always appropriate—some phenomena exhibit heavy tails, skewness, or multimodality. If we need distributions over functions that can capture these features, we must move beyond standard GPs to Student-t processes, mixtures of GPs, or other generalizations. We'll discuss such extensions in advanced modules.

4. Computational Efficiency:

Working with Gaussians involves linear algebra operations: matrix multiplication, inversion, and Cholesky decomposition. These operations are:

Well-understood mathematically
Numerically stable with appropriate techniques
Efficiently implemented in every scientific computing library
Easily parallelized on modern hardware

Contrast this with distributions requiring numerical integration or sampling—Gaussians admit closed-form solutions.

5. Complete Specification:

A Gaussian distribution is fully characterized by its first two moments—mean and covariance. Higher moments are determined by these. This means specifying a GP requires only:

A mean function $m(x)$
A kernel function $k(x, x')$

We don't need to specify third moments (skewness), fourth moments (kurtosis), or any higher-order structure. The kernel alone captures everything about the distribution's shape.

Properties That Make Gaussians Special
Property	Statement	Consequence for GPs
Closure under marginalization	Marginals of Gaussians are Gaussian	Can query any subset of function values
Closure under conditioning	Conditionals of Gaussians are Gaussian	Posterior after observations is still a GP
Closure under linear transformations	Linear combinations are Gaussian	Derivatives, integrals of GPs are GPs
Maximum entropy	Least informative for fixed mean/cov	Principled when we only specify correlation structure
Two-parameter family	Fully specified by μ and Σ	Kernel encodes complete prior

Visualizing the Prior Distribution

To develop intuition about GP priors, let's visualize the distribution over functions as confidence bands, not just individual samples.

At each input $x$, the GP prior gives us a marginal distribution for $f(x)$. With zero mean and kernel $k$:

$$f(x) \sim \mathcal{N}(0, k(x, x)) = \mathcal{N}(0, \sigma^2)$$

This marginal tells us: before seeing any data, each function value is distributed around zero with variance $\sigma^2$. But this marginal alone doesn't capture the correlation structure—for that, we need to examine how function values at different inputs co-vary.

Confidence bands:

For a GP prior $f \sim \mathcal{GP}(0, k)$, we can plot:

The mean function: $m(x) = 0$ (a flat line at zero)
The pointwise uncertainty: $\pm 2\sqrt{k(x,x)} = \pm 2\sigma$

These bands represent where 95% of function values fall at each point, assuming uncorrelated marginals. The bands don't capture correlation—they just show marginal uncertainty.

prior_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_gp_prior(length_scale=1.0, sigma=1.0, n_samples=20):
    """
    Comprehensive visualization of a GP prior.
    Shows mean, confidence bands, and sample functions.
    """
    np.random.seed(42)
    
    # Dense evaluation grid
    X = np.linspace(-5, 5, 200).reshape(-1, 1)
    n = len(X)
    
    # Build kernel matrix
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            sqdist = np.sum((X[i] - X[j]) ** 2)
            K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2))
    
    # Add jitter for numerical stability
    K += 1e-8 * np.eye(n)
    
    # Mean function (zero)
    mean = np.zeros(n)
    
    # Pointwise standard deviation
    std = np.sqrt(np.diag(K))
    
    # Sample functions
    L = np.linalg.cholesky(K)
    samples = mean[:, np.newaxis] + L @ np.random.randn(n, n_samples)
    
    # Create figure
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Mean and confidence bands
    ax = axes[0, 0]
    ax.fill_between(X.ravel(), mean - 2*std, mean + 2*std, 
                     alpha=0.2, color='C0', label='±2σ (95%)')
    ax.fill_between(X.ravel(), mean - std, mean + std, 
                     alpha=0.3, color='C0', label='±1σ (68%)')
    ax.plot(X, mean, 'C0-', linewidth=2, label='Mean')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_title('GP Prior: Mean and Confidence Bands')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-4, 4)
    
    # Plot 2: Sample functions
    ax = axes[0, 1]
    for i in range(min(10, n_samples)):
        ax.plot(X, samples[:, i], alpha=0.7, linewidth=1.5)
    ax.plot(X, mean, 'k--', linewidth=2, label='Prior mean')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_title(f'10 Sample Functions (ℓ={length_scale})')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-4, 4)
    
    # Plot 3: Kernel function
    ax = axes[1, 0]
    x_ref = 0.0
    kernel_vals = [sigma**2 * np.exp(-(x - x_ref)**2 / (2 * length_scale**2)) 
                   for x in X.ravel()]
    ax.plot(X, kernel_vals, 'C1-', linewidth=2)
    ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(x=x_ref, color='gray', linestyle='--', alpha=0.5)
    ax.fill_between(X.ravel(), 0, kernel_vals, alpha=0.3, color='C1')
    ax.set_xlabel("x'")
    ax.set_ylabel(f"k({x_ref}, x')")
    ax.set_title(f'Covariance Function (reference x={x_ref})')
    ax.grid(True, alpha=0.3)
    
    # Plot 4: Correlation at different separations
    ax = axes[1, 1]
    separations = np.linspace(0, 4, 100)
    correlations = np.exp(-separations**2 / (2 * length_scale**2))
    ax.plot(separations, correlations, 'C2-', linewidth=2)
    ax.axhline(y=np.exp(-0.5), color='gray', linestyle='--', alpha=0.5)
    ax.axvline(x=length_scale, color='gray', linestyle='--', alpha=0.5)
    ax.annotate(f'Length scale = {length_scale}', 
                xy=(length_scale, np.exp(-0.5)),
                xytext=(length_scale + 0.5, 0.3),
                arrowprops=dict(arrowstyle='->', color='gray'),
                fontsize=10)
    ax.set_xlabel('|x - x'|')
    ax.set_ylabel('Correlation')
    ax.set_title('Correlation Decay with Distance')
    ax.grid(True, alpha=0.3)
    
    plt.suptitle(f'Gaussian Process Prior (σ={sigma}, ℓ={length_scale})', 
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
 
# Visualize priors with different hyperparameters
visualize_gp_prior(length_scale=0.5, sigma=1.0)  # Short length scale
visualize_gp_prior(length_scale=2.0, sigma=1.0)  # Long length scale

Prior Visualization Insight

Examining prior samples is crucial for model development. If prior samples look unreasonable—too wiggly, too smooth, wrong scale—the posterior will inherit these problems. Always visualize prior samples before fitting to data.

Summary: GP as Distribution Over Functions

We've explored the fundamental concept underlying Gaussian Processes: treating functions as random variables drawn from probability distributions. This perspective shift enables principled uncertainty quantification and flexible nonparametric modeling.

Key Takeaways

•Functions as Random Variables — GPs treat the unknown function $f$ as a random variable with a probability distribution, rather than a fixed object to be estimated.
•Finite Marginals Define Everything — A GP is defined by specifying that any finite collection of function values has a joint Gaussian distribution. This sidesteps infinite-dimensional complexity.
•Two Functions Specify a GP — The mean function $m(x)$ and kernel $k(x, x')$ completely determine a GP. The kernel is especially important, encoding smoothness and correlation structure.
•The Kernel Encodes Inductive Bias — Rather than choosing a parametric functional form, we encode assumptions (smoothness, periodicity, etc.) through the kernel. This is a fundamentally different way to inject domain knowledge.
•Gaussians Are Special — The Gaussian distribution is uniquely suited for this task due to closure under marginalization and conditioning, maximum entropy, and computational tractability.
•Consistency Across Resolutions — GP predictions are consistent: adding new query points doesn't change predictions at existing points. This enables flexible evaluation.

What's Next:

With the foundation of GP priors established, we're ready to tackle the next crucial component: prior specification. We'll examine how to choose mean functions and kernels that encode appropriate domain knowledge, exploring the rich space of kernel designs and their implications for modeling.

Page Complete

You now understand the foundational concept of Gaussian Processes as distributions over functions. This perspective—treating functions as random variables—is the key insight that makes GPs powerful for Bayesian nonparametric regression. Next, we'll learn how to specify GP priors through careful choice of mean and kernel functions.