Loading content...
Throughout your machine learning journey, you've been trained to think in terms of parameters. Linear regression finds a weight vector $\mathbf{w}$. Neural networks optimize millions of weights and biases. Support vector machines discover support vectors. In every case, the learning problem reduces to finding the right finite-dimensional vector of numbers.
Gaussian Processes ask a radical question: What if, instead of defining a prior over parameters and then deriving the induced distribution over functions, we define a prior directly over functions themselves?
This is the function space view—the conceptual foundation upon which all of GP theory rests. It represents a profound paradigm shift: rather than working in a finite-dimensional parameter space and computing what functions those parameters imply, we work directly in an infinite-dimensional function space. The functions are the fundamental objects; parameters (if they appear at all) are secondary.
By the end of this page, you will understand why viewing models as distributions over functions—rather than distributions over parameters—is both mathematically elegant and practically powerful. You'll see how this perspective provides automatic uncertainty quantification, handles complex hypothesis spaces, and avoids many of the pathologies that plague finite-dimensional models.
To appreciate the function space view, we must first understand what we're leaving behind. Consider a standard parametric regression model:
$$f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$$
where $\boldsymbol{\phi}(\mathbf{x})$ maps inputs to a feature space and $\mathbf{w}$ is a weight vector. The learning problem involves finding $\mathbf{w}$ that best explains the data.
What's implicit here is crucial: the choice of $\boldsymbol{\phi}$ completely determines what functions can be represented. If $\boldsymbol{\phi}$ consists of polynomial basis functions up to degree 3, you can only represent polynomials up to degree 3. If the true function is degree 4, you're systematically biased—no amount of data will fix this.
These limitations aren't bugs—they're inherent to the parametric paradigm. When you parameterize a model, you're implicitly saying 'I believe the true function lives in this finite-dimensional subspace of all possible functions.' But that's already a very strong assumption. Gaussian Processes take a different path: they work in function space directly, avoiding this dimensional bottleneck entirely.
Before diving into distributions over functions, we need to understand what a function space actually is. The concept is simultaneously intuitive and profound.
Informal definition: A function space is the collection of all functions satisfying certain properties. Just as $\mathbb{R}^n$ is the space of all $n$-dimensional real vectors, a function space contains all functions of a certain type.
Key examples of function spaces:
The critical insight: Function spaces are typically infinite-dimensional. This is not merely technical—it has profound consequences.
Consider continuous functions on $[0,1]$. Any such function can be approximated arbitrarily well by polynomials (Weierstrass approximation theorem). But you need polynomials of arbitrarily high degree, so no finite set of basis functions spans all continuous functions. The space itself is infinite-dimensional.
$$\text{dim}(C[0,1]) = \infty$$
This might seem like a mathematical curiosity, but it's exactly what we want! Real-world functions—temperature over time, stock prices, medical signals—aren't restricted to any finite-dimensional subspace. They're genuinely infinite-dimensional objects, and our modeling framework should respect this.
| Aspect | Parameter Space | Function Space |
|---|---|---|
| Dimension | Finite ($\mathbb{R}^d$) | Infinite-dimensional |
| Points in space | Weight vectors $\mathbf{w}$ | Entire functions $f(\cdot)$ |
| Prior specification | Prior on weights $p(\mathbf{w})$ | Prior on functions $p(f)$ |
| Model capacity | Fixed by design | Infinite, data-adaptive |
| Basis functions | Must choose explicitly | Implicitly infinite |
| Uncertainty source | Uncertainty about $\mathbf{w}$ | Direct uncertainty about $f$ |
Think of function space as a 'catalog of all possible explanations' for your data. Each function in the space represents one complete story about how inputs map to outputs. A distribution over function space is your uncertainty about which story is true. Rather than searching through a restricted set of parameterized stories, you maintain uncertainty across all conceivable explanations.
Now comes the central conceptual leap: defining probability distributions over infinite-dimensional function spaces.
This immediately raises a question: how can we do probability in infinite dimensions? The usual tools—probability density functions, Lebesgue measure—break down in infinite dimensions. There's no natural 'uniform distribution' over functions, and we can't just write down $p(f)$ as a density.
The brilliant insight behind Gaussian Processes: We don't need to specify the full infinite-dimensional distribution explicitly. We only need to specify what happens when we evaluate the function at finitely many points. The rest is handled by consistency requirements.
The Kolmogorov Extension Theorem: This deep result from probability theory says that a consistent family of finite-dimensional distributions uniquely determines an infinite-dimensional stochastic process. For Gaussian Processes, this means:
If we specify that for any finite set of inputs $\mathbf{x}_1, \ldots, \mathbf{x}_n$, the function values $[f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]$ follow a multivariate Gaussian distribution...
And these specifications are consistent (meaning marginalization works correctly)...
Then there exists a unique probability distribution over the full function space!
We never explicitly construct the infinite-dimensional object. We just specify rules for finite subsets, and the mathematics guarantees a coherent infinite-dimensional distribution exists.
The Kolmogorov Extension Theorem is why Gaussian Processes work. Without it, 'distribution over functions' would be poetic language without mathematical content. With it, we have rigorous foundations: specify Gaussian distributions at all finite subsets consistently, and an infinite-dimensional Gaussian measure exists. This is not approximation—it's exact mathematical theory.
Practical implications of this construction:
Finite evaluation suffices: We never need to handle infinite-dimensional objects computationally. Every GP calculation involves only finitely many function evaluations.
Marginalization is automatic: Want to ignore some function values? Just drop them from the Gaussian—the mathematics ensures this is consistent with the full process.
Conditioning is tractable: Given observations, updating our beliefs about unobserved function values involves only Gaussian conditioning—closed-form and exact.
Coherent uncertainty: The uncertainty at unobserved locations is derived from a single consistent probabilistic model, not ad-hoc confidence intervals.
This is the power of the function space view: infinite-dimensional thinking with finite-dimensional computation.
To build intuition, let's visualize what it means to treat functions as points in a space.
Discrete approximation: Imagine evaluating functions on a fine grid of $N$ points: $x_1, x_2, \ldots, x_N$. Any function $f$ is then (approximately) represented by the vector of its values:
$$\mathbf{f} = [f(x_1), f(x_2), \ldots, f(x_N)]^\top \in \mathbb{R}^N$$
As $N \to \infty$ and the grid becomes denser, this vector representation approaches the full function. But at any finite $N$, we have a tractable $N$-dimensional vector.
A Gaussian in function space: If we place a multivariate Gaussian distribution on $\mathbf{f}$:
$$\mathbf{f} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$
where $\boldsymbol{\mu} = [\mu(x_1), \ldots, \mu(x_N)]^\top$ is the mean function evaluated at grid points and $\mathbf{K}_{ij} = k(x_i, x_j)$ is the covariance function evaluated pairwise, then we have a discrete approximation to a Gaussian Process.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky # Define a fine grid (discrete approximation to continuous domain)N = 200x_grid = np.linspace(0, 5, N) # Mean function: zero mean GP priormu = np.zeros(N) # Covariance function: Squared Exponential (RBF) kerneldef rbf_kernel(x1, x2, length_scale=1.0, variance=1.0): """Compute RBF kernel between two sets of points.""" sqdist = np.subtract.outer(x1, x2)**2 return variance * np.exp(-0.5 * sqdist / length_scale**2) # Build covariance matrixK = rbf_kernel(x_grid, x_grid, length_scale=1.0, variance=1.0) # Add small jitter for numerical stabilityK += 1e-8 * np.eye(N) # Sample functions from the GP prior# Each sample is a POINT in function spacenp.random.seed(42)L = cholesky(K, lower=True)n_samples = 5 plt.figure(figsize=(12, 6))for i in range(n_samples): # Sample from N(0, K) via Cholesky: f = L @ z, where z ~ N(0, I) z = np.random.randn(N) f_sample = L @ z # This is a function drawn from the GP prior plt.plot(x_grid, f_sample, label=f'Function sample {i+1}', alpha=0.8) plt.fill_between(x_grid, -2, 2, alpha=0.1, color='gray', label='±2σ credible region')plt.xlabel('Input x', fontsize=12)plt.ylabel('f(x)', fontsize=12)plt.title('Samples from a Gaussian Process Prior (Function Space View)', fontsize=14)plt.legend(loc='upper right')plt.grid(True, alpha=0.3)plt.xlim([0, 5])plt.ylim([-3, 3])plt.tight_layout()plt.show() print("Each curve is a 'point' in function space.")print(f"We're visualizing a {N}-dimensional Gaussian embedded in function space.")What this visualization reveals:
After observing data, the posterior GP will concentrate probability mass around functions consistent with observations, and the spread will shrink near observed points.
A natural question arises: if GPs work in infinite-dimensional function space, how can they be computed? The answer reveals a deep connection between the function space view and more familiar parametric models.
Mercer's Theorem: For a positive semi-definite kernel $k(\mathbf{x}, \mathbf{x}')$, there exist eigenvalues $\lambda_i > 0$ and orthonormal eigenfunctions $\phi_i(\mathbf{x})$ such that:
$$k(\mathbf{x}, \mathbf{x}') = \sum_{i=1}^{\infty} \lambda_i \phi_i(\mathbf{x}) \phi_i(\mathbf{x}')$$
This is a spectral decomposition of the kernel. It shows that every GP can be written as:
$$f(\mathbf{x}) = \sum_{i=1}^{\infty} w_i \sqrt{\lambda_i} \phi_i(\mathbf{x}), \quad w_i \sim \mathcal{N}(0, 1)$$
where the $w_i$ are independent Gaussian weights.
This is remarkable: every GP is equivalent to a linear model with infinitely many basis functions! The eigenfunctions $\phi_i$ are the implicit basis, and the eigenvalue-weighted Gaussian prior on coefficients gives us the GP. The function space view and the weight space view (which we'll explore next page) are mathematically equivalent—just different perspectives on the same object.
Why the function space view is still valuable:
Even though GPs are secretly infinite-dimensional linear models, the function space view offers distinct advantages:
Conceptual clarity: Thinking about smoothness, periodicity, and other function properties is more natural in function space than in coefficient space
Kernel design intuition: Kernels directly encode beliefs about functions (how smooth, how variable, how correlated across space). This is easier to reason about than infinite sets of basis functions
Computational tractability: The kernel trick lets us compute in function space without ever explicitly constructing the infinite basis—we only need kernel evaluations at finite point sets
Uncertainty interpretation: 'Uncertainty about which function' is more intuitive than 'uncertainty about infinitely many weights'
Non-parametric flexibility: We never commit to a finite number of basis functions. The effective complexity adapts to data automatically.
| Kernel | Formula | Functions Implied | Smoothness |
|---|---|---|---|
| Squared Exponential (RBF) | $\exp(-\frac{|\mathbf{x}-\mathbf{x}'|^2}{2\ell^2})$ | Infinitely differentiable, very smooth | $C^\infty$ |
| Matérn 3/2 | $(1+\frac{\sqrt{3}r}{\ell})\exp(-\frac{\sqrt{3}r}{\ell})$ | Once differentiable | $C^1$ |
| Matérn 5/2 | $(1+\frac{\sqrt{5}r}{\ell}+\frac{5r^2}{3\ell^2})\exp(-\frac{\sqrt{5}r}{\ell})$ | Twice differentiable | $C^2$ |
| Periodic | $\exp(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2})$ | Periodic with period $p$ | $C^\infty$ |
| Linear | $\sigma_b^2 + \sigma_v^2 (x-c)(x'-c)$ | Linear functions | Continuous |
A critical property of the function space view is marginalization consistency. This property is what makes GPs coherent across different subsets of inputs.
The marginalization property: If we have a Gaussian distribution over function values at inputs ${\mathbf{x}_1, \ldots, \mathbf{x}n, \mathbf{x}{n+1}}$:
$$\begin{bmatrix} \mathbf{f}{1:n} \ f{n+1} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu}{1:n} \ \mu{n+1} \end{bmatrix}, \begin{bmatrix} \mathbf{K}{1:n, 1:n} & \mathbf{k}{1:n, n+1} \ \mathbf{k}{n+1, 1:n} & k{n+1, n+1} \end{bmatrix} \right)$$
Then the marginal distribution over just $\mathbf{f}{1:n}$ (ignoring $f{n+1}$) is:
$$\mathbf{f}{1:n} \sim \mathcal{N}(\boldsymbol{\mu}{1:n}, \mathbf{K}_{1:n, 1:n})$$
This is simply the Gaussian marginalization property, but its implications for function space are profound.
Many ad-hoc methods for uncertainty quantification (bootstrap, ensemble methods) don't have this marginalization property. If you train a model on inputs A and B, then later ask about just A, the answer might differ from a model trained only on A. GPs guarantee consistency: your beliefs about function values at point A are the same whether or not you've also considered point B in your model.
A Gaussian Process is characterized by two functions: a mean function $m(\mathbf{x})$ and a covariance function $k(\mathbf{x}, \mathbf{x}')$. Let's understand the mean function's role in the function space view.
Definition: The mean function specifies our prior expectation for the function value at any input:
$$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$
Common choices:
Zero mean: $m(\mathbf{x}) = 0$ — The most common choice. This doesn't mean we expect the function to be zero everywhere; it means we have no systematic prior belief about whether functions are positive or negative. The data will inform this.
Constant mean: $m(\mathbf{x}) = \mu_0$ — Useful when you expect the function to hover around a particular value.
Linear mean: $m(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$ — Encodes a belief that the underlying trend is linear, with the GP capturing deviations.
Parametric mean: $m(\mathbf{x}) = h(\mathbf{x}; \boldsymbol{\theta})$ — Any parametric function can serve as the mean, with the GP modeling residuals.
Function space interpretation of the mean:
The mean function can be seen as the 'center' of our prior distribution in function space. All GP samples are distributed around this central function, with the spread determined by the covariance function.
$$f(\mathbf{x}) = m(\mathbf{x}) + g(\mathbf{x}), \quad g(\mathbf{x}) \sim \mathcal{GP}(0, k)$$
Here $g$ is a zero-mean GP capturing deviations from the mean trend. This decomposition shows that:
In practice, zero-mean GPs are most common because: (1) centering data achieves the same effect, (2) the posterior mean adapts to data anyway, and (3) fewer hyperparameters means simpler optimization. Use non-zero mean functions when you have strong prior knowledge about trends (e.g., physics-based models) or when extrapolation behavior matters significantly.
Perhaps the most compelling reason to adopt the function space view is the principled uncertainty quantification it provides.
Uncertainty in parametric models: In standard Bayesian regression, uncertainty arises from not knowing the true parameters. We have a posterior $p(\mathbf{w}|\mathcal{D})$, and we integrate over this uncertainty when making predictions:
$$p(f_|\mathbf{x}_, \mathcal{D}) = \int p(f_|\mathbf{x}_, \mathbf{w}) p(\mathbf{w}|\mathcal{D}) d\mathbf{w}$$
But this uncertainty is about parameters, and the induced uncertainty about functions depends heavily on the chosen basis.
Uncertainty in GPs: In contrast, GP uncertainty is directly about functions. We have a posterior distribution over function space:
$$p(f|\mathcal{D}) \quad \text{(distribution over entire functions)}$$
And we can extract predictions with uncertainty at any point:
$$f_* | \mathbf{x}*, \mathcal{D} \sim \mathcal{N}(\mu, \sigma_^2)$$
where $\mu_$ and $\sigma_^2$ come from Gaussian conditioning.
GPs make predictions with integrity: when they don't know, they say so. This is invaluable in decision-making contexts—Bayesian optimization, active learning, safety-critical systems—where understanding what the model doesn't know is as important as its point predictions.
The function space view provides a beautiful geometric interpretation of learning. Before seeing data, we have a prior distribution over function space—a probability cloud encompassing many possible functions. After observing data, we condition on the observations, concentrating probability mass on functions consistent with what we've seen.
The prior $p(f) = \mathcal{GP}(m, k)$: A diffuse cloud over function space, shaped by:
The likelihood $p(\mathcal{D}|f)$: A constraint that says 'functions passing near these observations are more likely.' For regression with Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:
$$p(y_i | f, \mathbf{x}_i) = \mathcal{N}(y_i | f(\mathbf{x}_i), \sigma_n^2)$$
The posterior $p(f|\mathcal{D}) \propto p(\mathcal{D}|f) p(f)$: The prior cloud sliced by the likelihood constraints. Remarkably, for GPs with Gaussian likelihood:
$$p(f|\mathcal{D}) = \mathcal{GP}(m', k')$$
The posterior is also a Gaussian Process, with analytically updated mean and covariance functions!
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import solve_triangular, cholesky # Setupnp.random.seed(42)x_grid = np.linspace(0, 5, 200) def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0): return variance * np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2) # Observed data (sparse, noisy observations)X_train = np.array([0.5, 1.5, 2.5, 4.0])y_train = np.array([0.8, -0.5, 0.3, -0.2])noise_var = 0.1 # Compute GP posteriorK_train = rbf_kernel(X_train, X_train) + noise_var * np.eye(len(X_train))K_star = rbf_kernel(x_grid, X_train)K_star_star = rbf_kernel(x_grid, x_grid) # Cholesky solve for efficiencyL = cholesky(K_train, lower=True)alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))v = solve_triangular(L, K_star.T, lower=True) # Posterior mean and covariancemu_post = K_star @ alphaK_post = K_star_star - v.T @ v # Sample from prior and posteriorL_prior = cholesky(K_star_star + 1e-8 * np.eye(len(x_grid)), lower=True)L_post = cholesky(K_post + 1e-8 * np.eye(len(x_grid)), lower=True) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Prior samplesax = axes[0]for i in range(5): sample = L_prior @ np.random.randn(len(x_grid)) ax.plot(x_grid, sample, alpha=0.7, linewidth=1.5)ax.axhline(y=0, color='black', linestyle='--', alpha=0.3, label='Prior mean')ax.fill_between(x_grid, -2, 2, alpha=0.1, color='blue', label='±2σ region')ax.set_xlabel('x', fontsize=12)ax.set_ylabel('f(x)', fontsize=12)ax.set_title('Prior: GP Before Observing Data', fontsize=14)ax.set_ylim([-3, 3])ax.legend(loc='upper right')ax.grid(True, alpha=0.3) # Right: Posterior samplesax = axes[1]for i in range(5): sample = mu_post + L_post @ np.random.randn(len(x_grid)) ax.plot(x_grid, sample, alpha=0.7, linewidth=1.5)ax.plot(x_grid, mu_post, 'k-', linewidth=2, label='Posterior mean')posterior_std = np.sqrt(np.diag(K_post))ax.fill_between(x_grid, mu_post - 2*posterior_std, mu_post + 2*posterior_std, alpha=0.2, color='blue', label='±2σ region')ax.scatter(X_train, y_train, c='red', s=100, zorder=5, edgecolors='black', label='Observations')ax.set_xlabel('x', fontsize=12)ax.set_ylabel('f(x)', fontsize=12)ax.set_title('Posterior: GP After Observing Data', fontsize=14)ax.set_ylim([-3, 3])ax.legend(loc='upper right')ax.grid(True, alpha=0.3) plt.tight_layout()plt.show() print("Prior: diffuse uncertainty everywhere")print("Posterior: uncertainty collapses near observations, remains high elsewhere")Key observations from the visualization:
We've now established the function space view—the conceptual foundation for understanding Gaussian Processes. Let's consolidate the key insights:
The function space view is one side of a fundamental duality in GP theory. In the next page, we'll explore the weight space view, which shows how GPs can equivalently be understood as Bayesian linear regression with an infinite number of basis functions. Together, these perspectives provide complete intuition for how and why GPs work.
You now understand the function space perspective on Gaussian Processes—viewing models as distributions over functions rather than parameters. This conceptual shift is foundational: it explains GP uncertainty quantification, marginalization consistency, and the non-parametric flexibility that makes GPs uniquely powerful. Proceed to understand the complementary weight space view.