Machine LearningGaussian Processes

Gaussian Process Fundamentals

LevelAdvanced

Duration90 mins

TopicGaussian Processes

2 / 5

Weight Space View

The Dual Perspective: From Weights to Functions

In the previous page, we approached Gaussian Processes from the function space view—defining distributions directly over functions in an infinite-dimensional space. This perspective is conceptually elegant but may feel abstract.

Now we'll develop the weight space view, which shows how GPs arise naturally from familiar Bayesian linear regression. This perspective provides computational intuition and reveals why GPs are often called 'nonparametric'—they can be understood as parametric models with infinitely many parameters.

The remarkable insight is that these two views are mathematically equivalent. They're different lenses on the same mathematical object, each illuminating different aspects of Gaussian Processes.

What You Will Learn

By the end of this page, you will understand how to derive Gaussian Processes from Bayesian linear regression, why the kernel function emerges as an inner product in feature space, and how the 'kernel trick' enables infinite-dimensional computations. You'll master the duality between function space and weight space perspectives—essential for both theoretical understanding and practical implementation.

Bayesian Linear Regression Review

To understand the weight space view, we begin with Bayesian linear regression—the foundation upon which we'll build.

The Model: Given input $\mathbf{x} \in \mathbb{R}^d$ and feature mapping $\boldsymbol{\phi}: \mathbb{R}^d \to \mathbb{R}^D$, we model the relationship as:

$$f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}) = \sum_{j=1}^{D} w_j \phi_j(\mathbf{x})$$

with observations corrupted by Gaussian noise:

$$y = f(\mathbf{x}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)$$

The Bayesian Approach: Rather than finding a single optimal $\mathbf{w}$, we maintain uncertainty by placing a prior over weights:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \mathbf{S}_p)$$

where $\mathbf{S}_p$ is the prior covariance matrix. A common choice is $\mathbf{S}_p = \sigma_w^2 \mathbf{I}_D$.

The Posterior Over Weights: Given training data $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^{n}$, Bayes' theorem gives us the posterior:

$$p(\mathbf{w}|\mathbf{y}, \mathbf{X}) \propto p(\mathbf{y}|\mathbf{X}, \mathbf{w}) p(\mathbf{w})$$

Since both prior and likelihood are Gaussian, the posterior is also Gaussian:

$$\mathbf{w}|\mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)$$

where: $$\boldsymbol{\Sigma}_w = (\mathbf{S}_p^{-1} + \sigma_n^{-2} \boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1}$$ $$\boldsymbol{\mu}_w = \sigma_n^{-2} \boldsymbol{\Sigma}_w \boldsymbol{\Phi}^\top \mathbf{y}$$

Here $\boldsymbol{\Phi}$ is the $n \times D$ design matrix with rows $\boldsymbol{\phi}(\mathbf{x}_i)^\top$.

Computational Consideration

The posterior computation involves inverting a $D \times D$ matrix, where $D$ is the number of features. When $D$ is large (approaching infinity), this becomes intractable. The weight space view reveals how GPs elegantly circumvent this by working with $n \times n$ matrices instead, where $n$ is the number of data points. This is the essence of the 'kernel trick.'

The Induced Distribution Over Functions

Here's the key insight: a prior over weights induces a prior over functions. Let's trace this connection carefully.

The function at any point $\mathbf{x}$ is a linear combination of weights:

$$f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$$

Since $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \mathbf{S}_p)$ is Gaussian, and $f(\mathbf{x})$ is a linear function of $\mathbf{w}$, the function value $f(\mathbf{x})$ is also Gaussian:

$$f(\mathbf{x}) \sim \mathcal{N}(0, \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}))$$

Now consider any two input points $\mathbf{x}$ and $\mathbf{x}'$. Since both function values are linear in $\mathbf{w}$:

$$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \mathbb{E}[f(\mathbf{x}) f(\mathbf{x}')] - \mathbb{E}[f(\mathbf{x})]\mathbb{E}[f(\mathbf{x}')]$$

With zero-mean prior: $$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \mathbb{E}[\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}) \boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{w}] = \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$$

The Critical Observation: This covariance between function values defines a kernel function:

$$k(\mathbf{x}, \mathbf{x}') \triangleq \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$$

For the common case $\mathbf{S}_p = \sigma_w^2 \mathbf{I}$, this simplifies to:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_w^2 \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$$

This is an inner product in feature space! The kernel measures similarity between inputs via their feature representations.

For any finite collection of inputs ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the joint distribution over function values is:

$$\begin{bmatrix} f(\mathbf{x}_1) \ \vdots \ f(\mathbf{x}_n) \end{bmatrix} \sim \mathcal{N}(\mathbf{0}, \mathbf{K})$$

where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.

This is exactly the definition of a Gaussian Process! The weight space prior induces a GP prior on functions.

The Weight-Function Duality

We've just proved that Bayesian linear regression with feature mapping φ and Gaussian prior on weights is equivalent to a Gaussian Process with kernel k(x, x') = φ(x)ᵀ Sₚ φ(x'). This is the heart of the weight space view: GPs are Bayesian linear regression in disguise—specifically, with (potentially infinitely many) basis functions.

The Kernel Trick: Infinite Features, Finite Computation

The weight-function duality reveals something remarkable: we never need to compute the features explicitly. All that matters is the kernel, which is the inner product in feature space.

The Kernel Trick: If we can compute $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ directly without forming the (potentially infinite-dimensional) features, we can work with arbitrarily complex feature spaces at finite computational cost.

Example: Polynomial Kernel

Consider inputs $x, x' \in \mathbb{R}$. Define:

$$k(x, x') = (1 + xx')^2$$

Expanding: $$k(x, x') = 1 + 2xx' + x^2 x'^2 = [1, \sqrt{2}x, x^2]^\top [1, \sqrt{2}x', x'^2]$$

So this kernel corresponds to features $\boldsymbol{\phi}(x) = [1, \sqrt{2}x, x^2]^\top$—polynomial features up to degree 2.

For $(1 + xx')^d$, we get features up to degree $d$. The computation is still just a single evaluation of $(1 + xx')^d$, regardless of how many features this represents!

Example: Gaussian (RBF) Kernel

The squared exponential (RBF) kernel is defined as:

$$k(x, x') = \exp\left(-\frac{(x - x')^2}{2\ell^2}\right)$$

This kernel corresponds to an infinite-dimensional feature space! Via Taylor expansion of the exponential:

$$\exp\left(-\frac{(x - x')^2}{2\ell^2}\right) = \exp\left(-\frac{x^2 + x'^2}{2\ell^2}\right) \exp\left(\frac{xx'}{\ell^2}\right)$$

$$= \exp\left(-\frac{x^2 + x'^2}{2\ell^2}\right) \sum_{n=0}^{\infty} \frac{(xx')^n}{\ell^{2n} n!}$$

The infinite sum reveals infinitely many polynomial features, each weighted by the exponential decay. The kernel computes this infinite inner product in closed form!

The Power of Kernelization

The kernel trick transforms what would require infinite-dimensional linear algebra into finite matrix operations. For n training points, we work with an n×n kernel matrix K, regardless of the (potentially infinite) dimensionality of the feature space. This is why GPs can represent arbitrarily complex functions while remaining computationally tractable.

Common Kernels and Their Implicit Feature Spaces
Kernel	Formula	Feature Space Dimension	Interpretation
Linear	$\mathbf{x}^\top \mathbf{x}'$	$d$ (input dim)	Standard linear regression
Polynomial	$(c + \mathbf{x}^\top \mathbf{x}')^p$	$\binom{d+p}{p}$	Polynomials up to degree $p$
RBF (Gaussian)	$\exp(-\frac{\|\mathbf{x}-\mathbf{x}'\|^2}{2\ell^2})$	$\infty$	All polynomial degrees
Laplacian	$\exp(-\frac{\|\mathbf{x}-\mathbf{x}'\|}{\ell})$	$\infty$	Less smooth than RBF
Matérn	Various forms	$\infty$	Controlled smoothness

Prediction in Weight Space

Let's derive GP predictions from the weight space perspective. This derivation shows how the kernel naturally emerges.

Setup: We have training data $(\mathbf{X}, \mathbf{y})$ and want to predict at test points $\mathbf{X}_*$.

Weight space posterior (from Section 1): $$\mathbf{w}|\mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)$$

Predictive distribution for test points: $$f_* = \boldsymbol{\Phi}_* \mathbf{w}$$

where $\boldsymbol{\Phi}_*$ is the feature matrix for test points.

Since $\mathbf{w}$ is Gaussian and $f_*$ is linear in $\mathbf{w}$:

$$f_|\mathcal{D} \sim \mathcal{N}(\boldsymbol{\Phi}_ \boldsymbol{\mu}w, \boldsymbol{\Phi}* \boldsymbol{\Sigma}w \boldsymbol{\Phi}*^\top)$$

The Kernelized Form: Now we'll show this equals the standard GP prediction. Substituting the expressions for $\boldsymbol{\mu}_w$ and $\boldsymbol{\Sigma}_w$ and applying matrix identities (specifically, the Woodbury identity):

Predictive Mean: $$\mathbb{E}[f_*] = \mathbf{K}{*f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$

Predictive Covariance: $$\text{Cov}[f_*] = \mathbf{K}{**} - \mathbf{K}{f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}{f}$$

where:

$\mathbf{K}_{ff}$ is the $n \times n$ kernel matrix between training points
$\mathbf{K}_{*f}$ is the kernel matrix between test and training points
$\mathbf{K}_{**}$ is the kernel matrix between test points

Crucially: These formulas only involve kernel evaluations, not explicit features! We've transformed from $D \times D$ operations (potentially infinite) to $n \times n$ operations (number of training points).

weight_space_vs_kernel_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from scipy.linalg import solve
 
# Setup: Polynomial regression as both weight-space and kernel-space
np.random.seed(42)
 
# Training data
X_train = np.array([0.5, 1.0, 2.0, 3.0, 4.0])
y_train = np.array([1.2, 0.8, 1.5, 3.0, 2.8])
n = len(X_train)
noise_var = 0.1
 
# Test points
X_test = np.linspace(0, 5, 100)
 
# === WEIGHT SPACE VIEW ===
# Polynomial features up to degree 3
def polynomial_features(x, degree=3):
    """Compute polynomial features [1, x, x^2, ..., x^d]"""
    return np.column_stack([x**i for i in range(degree + 1)])
 
degree = 3
Phi_train = polynomial_features(X_train, degree)  # n x (d+1)
Phi_test = polynomial_features(X_test, degree)    # n_test x (d+1)
 
# Prior covariance on weights (identity scaled)
prior_var = 1.0
S_p = prior_var * np.eye(degree + 1)
S_p_inv = np.eye(degree + 1) / prior_var
 
# Posterior on weights
A = S_p_inv + Phi_train.T @ Phi_train / noise_var
Sigma_w = np.linalg.inv(A)
mu_w = Sigma_w @ Phi_train.T @ y_train / noise_var
 
# Predictive distribution (weight space)
mean_ws = Phi_test @ mu_w
cov_ws = Phi_test @ Sigma_w @ Phi_test.T
 
print("=== WEIGHT SPACE VIEW ===")
print(f"Working in {degree + 1}-dimensional feature space")
print(f"Posterior weight mean: {mu_w.round(3)}")
 
# === KERNEL SPACE VIEW ===
# Polynomial kernel: k(x, x') = (1 + x*x')^degree
def polynomial_kernel(x1, x2, degree=3, c=1.0):
    """Polynomial kernel matrix"""
    return (c + np.outer(x1, x2))**degree
 
K_train = polynomial_kernel(X_train, X_train, degree) * prior_var
K_test_train = polynomial_kernel(X_test, X_train, degree) * prior_var
K_test = polynomial_kernel(X_test, X_test, degree) * prior_var
 
# GP prediction (kernel space)
L = np.linalg.cholesky(K_train + noise_var * np.eye(n))
alpha = solve(L.T, solve(L, y_train))
 
mean_ks = K_test_train @ alpha
v = solve(L, K_test_train.T)
cov_ks = K_test - v.T @ v
 
print("\n=== KERNEL SPACE VIEW ===")
print(f"Working with {n}x{n} kernel matrix (regardless of feature dim)")
 
# Verify equivalence
print("\n=== VERIFICATION ===")
print(f"Mean difference (max): {np.max(np.abs(mean_ws - mean_ks)):.2e}")
print(f"Covariance difference (max): {np.max(np.abs(cov_ws - cov_ks)):.2e}")
print("Both views give IDENTICAL predictions!")

The Equivalence Proven

The code demonstrates that weight space and kernel space computations give identical predictions—but kernel space requires only n×n matrix operations regardless of feature dimension. For infinite-dimensional kernels like RBF, weight space computation is impossible, but kernel space computation remains tractable!

Mercer's Theorem: When Is a Kernel Valid?

Not every function of two arguments can serve as a valid kernel. For the weight space interpretation to hold, the kernel must correspond to some inner product in a feature space. This is where Mercer's theorem provides the answer.

Mercer's Theorem (informal): A symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid kernel if and only if, for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n}$, the kernel matrix $\mathbf{K}$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semi-definite.

Positive semi-definite means: For all vectors $\mathbf{c} \in \mathbb{R}^n$:

$$\mathbf{c}^\top \mathbf{K} \mathbf{c} = \sum_{i,j} c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$

Why this matters: PSD kernels guarantee that:

A feature space representation exists
The kernel matrix eigenvalues are non-negative
The covariance matrix is valid (proper variances, realizable correlations)

Properties of Valid Kernels (Kernel Algebra)

•Scaling: If $k$ is valid, then $ck$ is valid for any $c > 0$
•Sum: If $k_1$ and $k_2$ are valid, then $k_1 + k_2$ is valid
•Product: If $k_1$ and $k_2$ are valid, then $k_1 \cdot k_2$ is valid
•Polynomial: If $k$ is valid, then $k^p$ is valid for positive integer $p$
•Exponential: If $k$ is valid, then $\exp(k)$ is valid
•Function composition: If $k$ is valid, then $k(g(\mathbf{x}), g(\mathbf{x}'))$ is valid for any function $g$

Building Complex Kernels

Kernel algebra lets you construct sophisticated kernels from simple building blocks. For example, combining a periodic kernel with an RBF decay creates functions that are 'locally periodic but globally decaying.' The weight space view guarantees that any kernel built via these operations corresponds to some (possibly complex) feature space.

The RKHS Connection: Every valid kernel defines a Reproducing Kernel Hilbert Space (RKHS)—a special function space where:

Functions can be represented as weighted sums of kernel evaluations
The kernel acts as a 'reproducing' element: $f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}}$
The RKHS norm penalizes function complexity according to the kernel's implicit smoothness assumptions

This connects GPs to regularization theory: the GP posterior mean is the function in the RKHS that minimizes a regularized loss, with the kernel controlling the regularization!

The Spectral Decomposition of Kernels

The weight space view becomes fully explicit through the spectral decomposition of the kernel. This reveals the exact basis functions and their associated 'weights.'

Mercer's Eigenvalue Theorem: For a positive semi-definite kernel $k$ on a compact domain, there exists a sequence of eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq 0$ and orthonormal eigenfunctions ${\phi_i}$ such that:

$$k(\mathbf{x}, \mathbf{x}') = \sum_{i=1}^{\infty} \lambda_i \phi_i(\mathbf{x}) \phi_i(\mathbf{x}')$$

The GP as Weighted Basis Expansion: This decomposition shows that a GP with kernel $k$ can be written as:

$$f(\mathbf{x}) = \sum_{i=1}^{\infty} \sqrt{\lambda_i} w_i \phi_i(\mathbf{x}), \quad w_i \sim \mathcal{N}(0, 1) \text{ i.i.d.}$$

where ${\phi_i}$ are the eigenfunctions and the eigenvalues ${\lambda_i}$ weight their contributions.

Intuition from Eigenvalue Decay:

Large eigenvalues → corresponding eigenfunctions contribute strongly to the GP
Small eigenvalues → eigenfunctions contribute weakly (effectively regularized away)
Eigenvalue decay rate → determines the 'effective dimensionality' of the GP

For the RBF kernel, eigenvalues decay exponentially, meaning:

Infinitely many eigenfunctions exist
But their contributions decay rapidly
The GP is effectively finite-dimensional, controlled by the length scale

For polynominal kernels, only finitely many eigenvalues are non-zero—corresponding to the finite-dimensional polynomial feature space.

Eigenvalue Decay for Common Kernels
Kernel	Eigenvalue Decay	Effective Dimensionality	Sample Function Properties
Linear	Finitely many non-zero	Input dimension $d$	Linear functions
Polynomial (degree $p$)	Finitely many non-zero	$\binom{d+p}{p}$	Polynomials up to degree $p$
RBF (Gaussian)	Exponential: $\lambda_i \sim e^{-ci}$	Infinite (rapidly decaying)	Infinitely smooth
Matérn-ν	Polynomial: $\lambda_i \sim i^{-(2\nu+d)/d}$	Infinite (slowly decaying)	$\nu$-times differentiable
Periodic	Depends on smoothness	Infinite	Periodic with specified period

Smoothness and Eigenvalue Decay

There's a deep connection: smoother kernels have faster eigenvalue decay. The RBF kernel's infinitely differentiable sample functions correspond to exponential eigenvalue decay. Matérn kernels with finite smoothness have polynomial decay. This connection—via Sobolev spaces—links GP theory to classical functional analysis.

Random Fourier Features: Approximating the Kernel

The weight space view enables a powerful approximation technique: Random Fourier Features (RFF). This method provides finite-dimensional approximations to kernels, enabling scalable GP inference.

Bochner's Theorem: A shift-invariant kernel $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$ is positive semi-definite if and only if it is the Fourier transform of a non-negative measure:

$$k(\mathbf{x} - \mathbf{x}') = \int p(\boldsymbol{\omega}) e^{i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} d\boldsymbol{\omega}$$

where $p(\boldsymbol{\omega})$ is a probability distribution called the spectral density.

For the RBF kernel with length scale $\ell$:

$$p(\boldsymbol{\omega}) = \mathcal{N}(\boldsymbol{\omega} | \mathbf{0}, \ell^{-2} \mathbf{I})$$

The Random Fourier Features Approximation:

Sample $m$ frequencies from the spectral density: $\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_m \sim p(\boldsymbol{\omega})$
Sample $m$ phases: $b_1, \ldots, b_m \sim \text{Uniform}(0, 2\pi)$
Define random features: $$\phi_j(\mathbf{x}) = \sqrt{\frac{2}{m}} \cos(\boldsymbol{\omega}_j^\top \mathbf{x} + b_j)$$
The approximate kernel is: $$k(\mathbf{x}, \mathbf{x}') \approx \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$$

Why this is powerful: With $m$ random features, we've reduced an $n \times n$ kernel matrix computation to an $n \times m$ feature matrix. For $m \ll n$, this dramatically speeds up GP training from $O(n^3)$ to $O(nm^2)$.

random_fourier_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import matplotlib.pyplot as plt
 
# Random Fourier Features approximation to RBF kernel
np.random.seed(42)
 
def rbf_kernel(x1, x2, length_scale=1.0):
    """Exact RBF kernel"""
    return np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2)
 
def random_fourier_features(x, omega, b):
    """Compute RFF features: sqrt(2/m) * cos(omega.T @ x + b)"""
    m = len(b)
    # x: (n,), omega: (m,), b: (m,)
    return np.sqrt(2.0 / m) * np.cos(np.outer(x, omega) + b)
 
# Setup
length_scale = 1.0
x = np.linspace(0, 5, 50)
 
# Exact kernel matrix
K_exact = rbf_kernel(x, x, length_scale)
 
# Compare RFF approximations with different m
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
 
for idx, m in enumerate([5, 20, 100, 500, 2000, 10000]):
    ax = axes[idx // 3, idx % 3]
    
    # Sample from spectral density: N(0, 1/ℓ²)
    omega = np.random.randn(m) / length_scale
    b = np.random.uniform(0, 2*np.pi, m)
    
    # Compute RFF features and approximate kernel
    Phi = random_fourier_features(x, omega, b)
    K_approx = Phi @ Phi.T
    
    # Compute error
    error = np.max(np.abs(K_exact - K_approx))
    
    # Plot
    im = ax.imshow(K_approx, cmap='viridis')
    ax.set_title(f'm = {m}\nMax Error: {error:.4f}')
    ax.set_xlabel('x index')
    ax.set_ylabel('x index')
 
plt.suptitle('Random Fourier Features Approximation to RBF Kernel', fontsize=14)
plt.tight_layout()
plt.show()
 
print("As m increases, RFF approximation converges to the exact kernel.")
print("This enables weight-space GP computation for infinite-dimensional kernels!")

Practical Applications of RFF

Random Fourier Features bridge the conceptual elegance of the weight space view with scalable computation. With m features, you can: (1) train GP-like models with O(nm²) complexity instead of O(n³), (2) use standard linear regression solvers, (3) apply stochastic gradient methods for massive datasets. This has enabled GP-scale uncertainty quantification in deep learning and large-scale optimization.

Connecting Function Space and Weight Space

We've now developed both perspectives on Gaussian Processes. Let's consolidate their relationship and understand when each view is most useful.

The Mathematical Equivalence:

Function Space View	Weight Space View
Distribution over functions $p(f)$	Distribution over weights $p(\mathbf{w})$ inducing functions
Specified by mean $m(\mathbf{x})$ and kernel $k(\mathbf{x}, \mathbf{x}')$	Specified by prior $p(\mathbf{w})$ and features $\boldsymbol{\phi}(\mathbf{x})$
Kernel $k(\mathbf{x}, \mathbf{x}')$	Inner product $\boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$
Works with infinite-dimensional function spaces	May require infinitely many basis functions
Computation: $O(n^3)$ via kernel matrices	Computation: $O(D^3)$ via weight posteriors

Key Insight: The views are equivalent when $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$. Every kernel corresponds to some feature space (and vice versa), but the kernel representation is often more compact.

When to Use Function Space View

•Conceptual understanding: Thinking about smoothness, periodicity, function properties
•Kernel design: Combining kernels, encoding prior knowledge about functions
•Uncertainty interpretation: Visualizing uncertainty as spread in function space
•Standard GP inference: When $n$ is moderate (up to ~10,000)
•Exact inference: When closed-form posteriors are available

When to Use Weight Space View

•Scalable inference: Using RFF or other approximate feature representations
•Neural network connections: Understanding infinite-width NNs as GPs
•Sparse approximations: Inducing point methods connect to feature perspectives
•Integration with linear methods: Using existing linear regression infrastructure
•Streaming/online learning: Weight-space updates can be more natural

The Best of Both Worlds

Expert GP practitioners fluently switch between views. Use function space for model building and interpretation—it's easier to reason about what kinds of functions your kernel implies. Use weight space for computation and approximation—especially when scaling to large datasets. The duality is a feature, not a bug.

Neural Networks and the Infinite-Width Limit

A beautiful connection emerges from the weight space view: neural networks become Gaussian Processes in the infinite-width limit. This theoretical result bridges deep learning and kernel methods.

Neal's Observation (1996): Consider a single hidden-layer neural network:

$$f(\mathbf{x}) = \sum_{j=1}^{H} v_j \sigma(\mathbf{w}_j^\top \mathbf{x})$$

where $\mathbf{w}_j, v_j$ are weights initialized i.i.d. from distributions with zero mean, and $\sigma$ is a nonlinear activation.

As $H \to \infty$: By the Central Limit Theorem, the sum of many independent random terms converges to a Gaussian. Thus:

$$f(\mathbf{x}) \xrightarrow{d} \mathcal{GP}(0, k_{\text{NN}})$$

where $k_{\text{NN}}$ is a kernel determined by the activation function and weight initialization.

The Neural Network Kernel: For ReLU activations with standard initialization:

$$k_{\text{NN}}(\mathbf{x}, \mathbf{x}') = \frac{|\mathbf{x}| |\mathbf{x}'|}{2\pi} (\sin \theta + (\pi - \theta) \cos \theta)$$

where $\theta = \cos^{-1}\left(\frac{\mathbf{x}^\top \mathbf{x}'}{|\mathbf{x}| |\mathbf{x}'|}\right)$ is the angle between inputs.

For Deep Networks (Jacot et al., 2018): The connection extends to deep networks via the Neural Tangent Kernel (NTK). Infinite-width deep networks trained by gradient descent behave like kernel regression with a specific kernel determined by the architecture.

Implications:

GPs provide analytical tools to understand neural network inductive biases
Kernel methods can approximate neural network behavior
Theoretical analysis of infinite-width networks is tractable via GP theory

The Deep Learning Connection

This connection isn't just theoretical curiosity. It explains why neural networks generalize well, provides initialization schemes (to start near the GP regime), and suggests that somewhere between 'exactly GP' and 'finite-width NN' lies a tradeoff between analytical tractability and expressive power. Understanding both paradigms enriches your intuition for both.

Summary: The Weight Space Perspective

We've now developed the complete weight space view of Gaussian Processes. Here are the essential takeaways:

Core Concepts Mastered

•GPs emerge from Bayesian linear regression: A Gaussian prior over weights, combined with a feature mapping, induces a Gaussian Process over functions.
•The kernel is an inner product in feature space: $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$ connects kernel evaluations to feature representations.
•The kernel trick enables infinite dimensions: We never compute features explicitly—only kernel evaluations. This allows infinite-dimensional feature spaces at finite cost.
•Mercer's theorem characterizes valid kernels: Positive semi-definiteness is necessary and sufficient. Kernel algebra enables complex kernel construction.
•Spectral decomposition reveals basis functions: Every kernel admits an eigenfunction expansion, making the hidden basis explicit.
•Random Fourier Features approximate kernels: Sampling from the spectral density provides finite-dimensional approximations for scalable inference.
•Neural networks connect to GPs at infinite width: Deep learning and kernel methods meet in the infinite-width limit.

What's Next

With both function space and weight space views established, we're ready to formally define Gaussian Processes and their key mathematical properties. The next page presents the rigorous GP definition, bringing together the intuitions we've developed into a complete mathematical framework.

Page Complete

You now understand the weight space perspective on Gaussian Processes—showing how GPs emerge from Bayesian linear regression with (potentially infinite) basis functions. This view provides computational intuition, connects to neural networks, and enables scalable approximations. Together with the function space view, you have complete dual perspectives on GP fundamentals.

2 / 5

Loading learning content...