Gp Fundamentals - Learning Module

Loading content...

0/245

Function Space View

From Parameters to Functions: A Fundamental Shift

Throughout your machine learning journey, you've been trained to think in terms of parameters. Linear regression finds a weight vector $\mathbf{w}$. Neural networks optimize millions of weights and biases. Support vector machines discover support vectors. In every case, the learning problem reduces to finding the right finite-dimensional vector of numbers.

Gaussian Processes ask a radical question: What if, instead of defining a prior over parameters and then deriving the induced distribution over functions, we define a prior directly over functions themselves?

This is the function space view—the conceptual foundation upon which all of GP theory rests. It represents a profound paradigm shift: rather than working in a finite-dimensional parameter space and computing what functions those parameters imply, we work directly in an infinite-dimensional function space. The functions are the fundamental objects; parameters (if they appear at all) are secondary.

What You Will Learn

By the end of this page, you will understand why viewing models as distributions over functions—rather than distributions over parameters—is both mathematically elegant and practically powerful. You'll see how this perspective provides automatic uncertainty quantification, handles complex hypothesis spaces, and avoids many of the pathologies that plague finite-dimensional models.

The Limitations of Parametric Thinking

To appreciate the function space view, we must first understand what we're leaving behind. Consider a standard parametric regression model:

$$f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$$

where $\boldsymbol{\phi}(\mathbf{x})$ maps inputs to a feature space and $\mathbf{w}$ is a weight vector. The learning problem involves finding $\mathbf{w}$ that best explains the data.

What's implicit here is crucial: the choice of $\boldsymbol{\phi}$ completely determines what functions can be represented. If $\boldsymbol{\phi}$ consists of polynomial basis functions up to degree 3, you can only represent polynomials up to degree 3. If the true function is degree 4, you're systematically biased—no amount of data will fix this.

Fundamental Limitations of Parametric Models

•Model Specification Problem: You must choose basis functions before seeing data. But how do you know what functional forms are appropriate? This is often guesswork disguised as domain expertise.
•Capacity-Complexity Tradeoff: Too few basis functions → underfitting (can't capture true complexity). Too many → overfitting (insufficient regularization). Finding the right number is already a model selection problem.
•Uncertainty Quantification is Indirect: Bayesian treatment places a prior over $\mathbf{w}$, not directly over functions. The induced distribution over functions depends on both the prior and the basis functions—entangling two separate modeling choices.
•Inflexibility at Boundaries: Parametric models commit globally. A polynomial that fits well in one region may oscillate wildly in another. Local complexity variations are hard to express.
•Fixed Expressiveness: The model capacity is fixed at design time. You can't gracefully grow more complex as more data arrives.

The Deeper Issue

These limitations aren't bugs—they're inherent to the parametric paradigm. When you parameterize a model, you're implicitly saying 'I believe the true function lives in this finite-dimensional subspace of all possible functions.' But that's already a very strong assumption. Gaussian Processes take a different path: they work in function space directly, avoiding this dimensional bottleneck entirely.

What is a Function Space?

Before diving into distributions over functions, we need to understand what a function space actually is. The concept is simultaneously intuitive and profound.

Informal definition: A function space is the collection of all functions satisfying certain properties. Just as $\mathbb{R}^n$ is the space of all $n$-dimensional real vectors, a function space contains all functions of a certain type.

Key examples of function spaces:

$C[a,b]$: The space of all continuous real-valued functions on the interval $[a,b]$
$L^2(\mathbb{R})$: The space of square-integrable functions (functions with finite 'energy')
Reproducing Kernel Hilbert Spaces (RKHS): Spaces of functions induced by kernel functions—deeply connected to GPs

The critical insight: Function spaces are typically infinite-dimensional. This is not merely technical—it has profound consequences.

Consider continuous functions on $[0,1]$. Any such function can be approximated arbitrarily well by polynomials (Weierstrass approximation theorem). But you need polynomials of arbitrarily high degree, so no finite set of basis functions spans all continuous functions. The space itself is infinite-dimensional.

$$\text{dim}(C[0,1]) = \infty$$

This might seem like a mathematical curiosity, but it's exactly what we want! Real-world functions—temperature over time, stock prices, medical signals—aren't restricted to any finite-dimensional subspace. They're genuinely infinite-dimensional objects, and our modeling framework should respect this.

Comparing Parameter Space vs Function Space
Aspect	Parameter Space	Function Space
Dimension	Finite ($\mathbb{R}^d$)	Infinite-dimensional
Points in space	Weight vectors $\mathbf{w}$	Entire functions $f(\cdot)$
Prior specification	Prior on weights $p(\mathbf{w})$	Prior on functions $p(f)$
Model capacity	Fixed by design	Infinite, data-adaptive
Basis functions	Must choose explicitly	Implicitly infinite
Uncertainty source	Uncertainty about $\mathbf{w}$	Direct uncertainty about $f$

Intuition Builder

Think of function space as a 'catalog of all possible explanations' for your data. Each function in the space represents one complete story about how inputs map to outputs. A distribution over function space is your uncertainty about which story is true. Rather than searching through a restricted set of parameterized stories, you maintain uncertainty across all conceivable explanations.

Distributions Over Functions

Now comes the central conceptual leap: defining probability distributions over infinite-dimensional function spaces.

This immediately raises a question: how can we do probability in infinite dimensions? The usual tools—probability density functions, Lebesgue measure—break down in infinite dimensions. There's no natural 'uniform distribution' over functions, and we can't just write down $p(f)$ as a density.

The brilliant insight behind Gaussian Processes: We don't need to specify the full infinite-dimensional distribution explicitly. We only need to specify what happens when we evaluate the function at finitely many points. The rest is handled by consistency requirements.

The Kolmogorov Extension Theorem: This deep result from probability theory says that a consistent family of finite-dimensional distributions uniquely determines an infinite-dimensional stochastic process. For Gaussian Processes, this means:

If we specify that for any finite set of inputs $\mathbf{x}_1, \ldots, \mathbf{x}_n$, the function values $[f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]$ follow a multivariate Gaussian distribution...
And these specifications are consistent (meaning marginalization works correctly)...
Then there exists a unique probability distribution over the full function space!

We never explicitly construct the infinite-dimensional object. We just specify rules for finite subsets, and the mathematics guarantees a coherent infinite-dimensional distribution exists.

Mathematical Foundation

The Kolmogorov Extension Theorem is why Gaussian Processes work. Without it, 'distribution over functions' would be poetic language without mathematical content. With it, we have rigorous foundations: specify Gaussian distributions at all finite subsets consistently, and an infinite-dimensional Gaussian measure exists. This is not approximation—it's exact mathematical theory.

Practical implications of this construction:

Finite evaluation suffices: We never need to handle infinite-dimensional objects computationally. Every GP calculation involves only finitely many function evaluations.
Marginalization is automatic: Want to ignore some function values? Just drop them from the Gaussian—the mathematics ensures this is consistent with the full process.
Conditioning is tractable: Given observations, updating our beliefs about unobserved function values involves only Gaussian conditioning—closed-form and exact.
Coherent uncertainty: The uncertainty at unobserved locations is derived from a single consistent probabilistic model, not ad-hoc confidence intervals.

This is the power of the function space view: infinite-dimensional thinking with finite-dimensional computation.

Visualizing Functions as Points in Space

To build intuition, let's visualize what it means to treat functions as points in a space.

Discrete approximation: Imagine evaluating functions on a fine grid of $N$ points: $x_1, x_2, \ldots, x_N$. Any function $f$ is then (approximately) represented by the vector of its values:

$$\mathbf{f} = [f(x_1), f(x_2), \ldots, f(x_N)]^\top \in \mathbb{R}^N$$

As $N \to \infty$ and the grid becomes denser, this vector representation approaches the full function. But at any finite $N$, we have a tractable $N$-dimensional vector.

A Gaussian in function space: If we place a multivariate Gaussian distribution on $\mathbf{f}$:

$$\mathbf{f} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$

where $\boldsymbol{\mu} = [\mu(x_1), \ldots, \mu(x_N)]^\top$ is the mean function evaluated at grid points and $\mathbf{K}_{ij} = k(x_i, x_j)$ is the covariance function evaluated pairwise, then we have a discrete approximation to a Gaussian Process.

gp_function_space_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
# Define a fine grid (discrete approximation to continuous domain)
N = 200
x_grid = np.linspace(0, 5, N)
 
# Mean function: zero mean GP prior
mu = np.zeros(N)
 
# Covariance function: Squared Exponential (RBF) kernel
def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    """Compute RBF kernel between two sets of points."""
    sqdist = np.subtract.outer(x1, x2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
# Build covariance matrix
K = rbf_kernel(x_grid, x_grid, length_scale=1.0, variance=1.0)
 
# Add small jitter for numerical stability
K += 1e-8 * np.eye(N)
 
# Sample functions from the GP prior
# Each sample is a POINT in function space
np.random.seed(42)
L = cholesky(K, lower=True)
n_samples = 5
 
plt.figure(figsize=(12, 6))
for i in range(n_samples):
    # Sample from N(0, K) via Cholesky: f = L @ z, where z ~ N(0, I)
    z = np.random.randn(N)
    f_sample = L @ z  # This is a function drawn from the GP prior
    
    plt.plot(x_grid, f_sample, label=f'Function sample {i+1}', alpha=0.8)
 
plt.fill_between(x_grid, -2, 2, alpha=0.1, color='gray', 
                  label='±2σ credible region')
plt.xlabel('Input x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Samples from a Gaussian Process Prior (Function Space View)', fontsize=14)
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.xlim([0, 5])
plt.ylim([-3, 3])
plt.tight_layout()
plt.show()
 
print("Each curve is a 'point' in function space.")
print(f"We're visualizing a {N}-dimensional Gaussian embedded in function space.")

What this visualization reveals:

Each curve is a single sample from the GP prior—one 'point' in function space
The curves are correlated: nearby $x$ values have correlated function values (this is the covariance function at work)
The spread indicates uncertainty: the gray region shows where we expect samples to lie
No single curve is the 'true' one: the GP prior treats all these functions as plausible a priori

After observing data, the posterior GP will concentrate probability mass around functions consistent with observations, and the spread will shrink near observed points.

The GP Prior as Infinite Implicit Basis

A natural question arises: if GPs work in infinite-dimensional function space, how can they be computed? The answer reveals a deep connection between the function space view and more familiar parametric models.

Mercer's Theorem: For a positive semi-definite kernel $k(\mathbf{x}, \mathbf{x}')$, there exist eigenvalues $\lambda_i > 0$ and orthonormal eigenfunctions $\phi_i(\mathbf{x})$ such that:

$$k(\mathbf{x}, \mathbf{x}') = \sum_{i=1}^{\infty} \lambda_i \phi_i(\mathbf{x}) \phi_i(\mathbf{x}')$$

This is a spectral decomposition of the kernel. It shows that every GP can be written as:

$$f(\mathbf{x}) = \sum_{i=1}^{\infty} w_i \sqrt{\lambda_i} \phi_i(\mathbf{x}), \quad w_i \sim \mathcal{N}(0, 1)$$

where the $w_i$ are independent Gaussian weights.

The Connection Revealed

This is remarkable: every GP is equivalent to a linear model with infinitely many basis functions! The eigenfunctions $\phi_i$ are the implicit basis, and the eigenvalue-weighted Gaussian prior on coefficients gives us the GP. The function space view and the weight space view (which we'll explore next page) are mathematically equivalent—just different perspectives on the same object.

Why the function space view is still valuable:

Even though GPs are secretly infinite-dimensional linear models, the function space view offers distinct advantages:

Conceptual clarity: Thinking about smoothness, periodicity, and other function properties is more natural in function space than in coefficient space
Kernel design intuition: Kernels directly encode beliefs about functions (how smooth, how variable, how correlated across space). This is easier to reason about than infinite sets of basis functions
Computational tractability: The kernel trick lets us compute in function space without ever explicitly constructing the infinite basis—we only need kernel evaluations at finite point sets
Uncertainty interpretation: 'Uncertainty about which function' is more intuitive than 'uncertainty about infinitely many weights'
Non-parametric flexibility: We never commit to a finite number of basis functions. The effective complexity adapts to data automatically.

Common Kernels and Their Implicit Function Properties
Kernel	Formula	Functions Implied	Smoothness
Squared Exponential (RBF)	$\exp(-\frac{\|\mathbf{x}-\mathbf{x}'\|^2}{2\ell^2})$	Infinitely differentiable, very smooth	$C^\infty$
Matérn 3/2	$(1+\frac{\sqrt{3}r}{\ell})\exp(-\frac{\sqrt{3}r}{\ell})$	Once differentiable	$C^1$
Matérn 5/2	$(1+\frac{\sqrt{5}r}{\ell}+\frac{5r^2}{3\ell^2})\exp(-\frac{\sqrt{5}r}{\ell})$	Twice differentiable	$C^2$
Periodic	$\exp(-\frac{2\sin^2(\pi\|x-x'\|/p)}{\ell^2})$	Periodic with period $p$	$C^\infty$
Linear	$\sigma_b^2 + \sigma_v^2 (x-c)(x'-c)$	Linear functions	Continuous

Consistency and Marginalization

A critical property of the function space view is marginalization consistency. This property is what makes GPs coherent across different subsets of inputs.

The marginalization property: If we have a Gaussian distribution over function values at inputs ${\mathbf{x}_1, \ldots, \mathbf{x}n, \mathbf{x}{n+1}}$:

$$\begin{bmatrix} \mathbf{f}{1:n} \ f{n+1} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu}{1:n} \ \mu{n+1} \end{bmatrix}, \begin{bmatrix} \mathbf{K}{1:n, 1:n} & \mathbf{k}{1:n, n+1} \ \mathbf{k}{n+1, 1:n} & k{n+1, n+1} \end{bmatrix} \right)$$

Then the marginal distribution over just $\mathbf{f}{1:n}$ (ignoring $f{n+1}$) is:

$$\mathbf{f}{1:n} \sim \mathcal{N}(\boldsymbol{\mu}{1:n}, \mathbf{K}_{1:n, 1:n})$$

This is simply the Gaussian marginalization property, but its implications for function space are profound.

Implications of Marginalization Consistency

•Path independence: We can add or remove inputs from consideration without changing the distributions over the remaining inputs. The model is 'self-consistent'.
•Prediction at arbitrary locations: We can make predictions at any new input locations, not just a predefined grid. The GP automatically provides coherent uncertainty.
•Sequential learning is natural: Add data points one at a time—the GP updates consistently. There's no need to redefine the model space.
•Interpolation without artifacts: Between observed points, the GP provides principled interpolation with appropriate uncertainty, not ad-hoc curve-fitting.
•Multi-fidelity reasoning: We can combine observations at different resolutions or from different sources, maintaining consistency across scales.

Why This Matters Practically

Many ad-hoc methods for uncertainty quantification (bootstrap, ensemble methods) don't have this marginalization property. If you train a model on inputs A and B, then later ask about just A, the answer might differ from a model trained only on A. GPs guarantee consistency: your beliefs about function values at point A are the same whether or not you've also considered point B in your model.

The Role of the Mean Function

A Gaussian Process is characterized by two functions: a mean function $m(\mathbf{x})$ and a covariance function $k(\mathbf{x}, \mathbf{x}')$. Let's understand the mean function's role in the function space view.

Definition: The mean function specifies our prior expectation for the function value at any input:

$$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$

Common choices:

Zero mean: $m(\mathbf{x}) = 0$ — The most common choice. This doesn't mean we expect the function to be zero everywhere; it means we have no systematic prior belief about whether functions are positive or negative. The data will inform this.
Constant mean: $m(\mathbf{x}) = \mu_0$ — Useful when you expect the function to hover around a particular value.
Linear mean: $m(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$ — Encodes a belief that the underlying trend is linear, with the GP capturing deviations.
Parametric mean: $m(\mathbf{x}) = h(\mathbf{x}; \boldsymbol{\theta})$ — Any parametric function can serve as the mean, with the GP modeling residuals.

Function space interpretation of the mean:

The mean function can be seen as the 'center' of our prior distribution in function space. All GP samples are distributed around this central function, with the spread determined by the covariance function.

$$f(\mathbf{x}) = m(\mathbf{x}) + g(\mathbf{x}), \quad g(\mathbf{x}) \sim \mathcal{GP}(0, k)$$

Here $g$ is a zero-mean GP capturing deviations from the mean trend. This decomposition shows that:

The mean function captures systematic trends or prior knowledge
The covariance function models the structure of uncertain variations around this trend
Together, they specify a complete prior over functions

Practical Guidance

In practice, zero-mean GPs are most common because: (1) centering data achieves the same effect, (2) the posterior mean adapts to data anyway, and (3) fewer hyperparameters means simpler optimization. Use non-zero mean functions when you have strong prior knowledge about trends (e.g., physics-based models) or when extrapolation behavior matters significantly.

Function Space and Uncertainty Quantification

Perhaps the most compelling reason to adopt the function space view is the principled uncertainty quantification it provides.

Uncertainty in parametric models: In standard Bayesian regression, uncertainty arises from not knowing the true parameters. We have a posterior $p(\mathbf{w}|\mathcal{D})$, and we integrate over this uncertainty when making predictions:

$$p(f_|\mathbf{x}_, \mathcal{D}) = \int p(f_|\mathbf{x}_, \mathbf{w}) p(\mathbf{w}|\mathcal{D}) d\mathbf{w}$$

But this uncertainty is about parameters, and the induced uncertainty about functions depends heavily on the chosen basis.

Uncertainty in GPs: In contrast, GP uncertainty is directly about functions. We have a posterior distribution over function space:

$$p(f|\mathcal{D}) \quad \text{(distribution over entire functions)}$$

And we can extract predictions with uncertainty at any point:

$$f_* | \mathbf{x}*, \mathcal{D} \sim \mathcal{N}(\mu, \sigma_^2)$$

where $\mu_$ and $\sigma_^2$ come from Gaussian conditioning.

GP Uncertainty Properties

•Data-driven: Uncertainty shrinks where we have observations, grows where we don't
•Continuous: Uncertainty varies smoothly with input location
•Honest: Far from data, uncertainty returns to prior levels—no overconfident extrapolation
•Calibrated: Credible intervals have correct coverage (in the well-specified case)
•Multi-scale: Captures both local and global uncertainty

Common UQ Pitfalls Avoided

•Overconfident extrapolation: Neural networks often become arbitrarily confident away from training data
•Discontinuous uncertainty: Ensemble methods can have jumpy confidence estimates
•Training-dependent calibration: Bootstrap requires retraining, results vary with resampling
•No principled intervals: Point estimates + heuristics don't yield proper posteriors
•Model misspecification unawareness: Parametric models can't know their basis is wrong

The Honest Uncertainty Promise

GPs make predictions with integrity: when they don't know, they say so. This is invaluable in decision-making contexts—Bayesian optimization, active learning, safety-critical systems—where understanding what the model doesn't know is as important as its point predictions.

From Priors to Posteriors in Function Space

The function space view provides a beautiful geometric interpretation of learning. Before seeing data, we have a prior distribution over function space—a probability cloud encompassing many possible functions. After observing data, we condition on the observations, concentrating probability mass on functions consistent with what we've seen.

The prior $p(f) = \mathcal{GP}(m, k)$: A diffuse cloud over function space, shaped by:

Mean function: where the cloud is centered
Covariance function: how the cloud is stretched and oriented

The likelihood $p(\mathcal{D}|f)$: A constraint that says 'functions passing near these observations are more likely.' For regression with Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:

$$p(y_i | f, \mathbf{x}_i) = \mathcal{N}(y_i | f(\mathbf{x}_i), \sigma_n^2)$$

The posterior $p(f|\mathcal{D}) \propto p(\mathcal{D}|f) p(f)$: The prior cloud sliced by the likelihood constraints. Remarkably, for GPs with Gaussian likelihood:

$$p(f|\mathcal{D}) = \mathcal{GP}(m', k')$$

The posterior is also a Gaussian Process, with analytically updated mean and covariance functions!

prior_to_posterior_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import solve_triangular, cholesky
 
# Setup
np.random.seed(42)
x_grid = np.linspace(0, 5, 200)
 
def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    return variance * np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2)
 
# Observed data (sparse, noisy observations)
X_train = np.array([0.5, 1.5, 2.5, 4.0])
y_train = np.array([0.8, -0.5, 0.3, -0.2])
noise_var = 0.1
 
# Compute GP posterior
K_train = rbf_kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
K_star = rbf_kernel(x_grid, X_train)
K_star_star = rbf_kernel(x_grid, x_grid)
 
# Cholesky solve for efficiency
L = cholesky(K_train, lower=True)
alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
v = solve_triangular(L, K_star.T, lower=True)
 
# Posterior mean and covariance
mu_post = K_star @ alpha
K_post = K_star_star - v.T @ v
 
# Sample from prior and posterior
L_prior = cholesky(K_star_star + 1e-8 * np.eye(len(x_grid)), lower=True)
L_post = cholesky(K_post + 1e-8 * np.eye(len(x_grid)), lower=True)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Prior samples
ax = axes[0]
for i in range(5):
    sample = L_prior @ np.random.randn(len(x_grid))
    ax.plot(x_grid, sample, alpha=0.7, linewidth=1.5)
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3, label='Prior mean')
ax.fill_between(x_grid, -2, 2, alpha=0.1, color='blue', label='±2σ region')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('f(x)', fontsize=12)
ax.set_title('Prior: GP Before Observing Data', fontsize=14)
ax.set_ylim([-3, 3])
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
# Right: Posterior samples
ax = axes[1]
for i in range(5):
    sample = mu_post + L_post @ np.random.randn(len(x_grid))
    ax.plot(x_grid, sample, alpha=0.7, linewidth=1.5)
ax.plot(x_grid, mu_post, 'k-', linewidth=2, label='Posterior mean')
posterior_std = np.sqrt(np.diag(K_post))
ax.fill_between(x_grid, mu_post - 2*posterior_std, mu_post + 2*posterior_std,
                alpha=0.2, color='blue', label='±2σ region')
ax.scatter(X_train, y_train, c='red', s=100, zorder=5, edgecolors='black', 
           label='Observations')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('f(x)', fontsize=12)
ax.set_title('Posterior: GP After Observing Data', fontsize=14)
ax.set_ylim([-3, 3])
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
print("Prior: diffuse uncertainty everywhere")
print("Posterior: uncertainty collapses near observations, remains high elsewhere")

Key observations from the visualization:

Prior samples span wildly different functions—all are plausible before seeing data
Posterior samples are constrained to pass near observations (within noise tolerance)
Uncertainty collapses at observation locations and grows gradually with distance
Far from data, the posterior returns toward the prior—honest about ignorance
The posterior mean is a principled smoothed interpolation through the data

Summary: The Function Space Perspective

We've now established the function space view—the conceptual foundation for understanding Gaussian Processes. Let's consolidate the key insights:

Core Concepts Mastered

•Function space is infinite-dimensional: Unlike parameter space with fixed dimensions, function space encompasses all possible functions of a given type.
•GPs define distributions over functions: Using the Kolmogorov Extension Theorem, we specify consistent finite-dimensional Gaussians that determine an infinite-dimensional distribution.
•Computation remains tractable: We never explicitly handle infinite dimensions—finite-dimensional Gaussian operations suffice for all practical calculations.
•Marginalization consistency guarantees coherence: Adding or removing inputs doesn't change the distributions over remaining inputs—a powerful self-consistency property.
•The mean function centers the prior: It encodes systematic trends, while the covariance function shapes the structure of variations around this center.
•Uncertainty quantification is principled: GP uncertainty comes directly from the function space perspective—not as an afterthought, but as an integral part of the model.
•Prior-to-posterior updates are exact: For Gaussian likelihoods, the posterior is analytically tractable—no approximations needed.

What's Next

The function space view is one side of a fundamental duality in GP theory. In the next page, we'll explore the weight space view, which shows how GPs can equivalently be understood as Bayesian linear regression with an infinite number of basis functions. Together, these perspectives provide complete intuition for how and why GPs work.

Page Complete

You now understand the function space perspective on Gaussian Processes—viewing models as distributions over functions rather than parameters. This conceptual shift is foundational: it explains GP uncertainty quantification, marginalization consistency, and the non-parametric flexibility that makes GPs uniquely powerful. Proceed to understand the complementary weight space view.