Loading learning content...
In the previous page, we approached Gaussian Processes from the function space view—defining distributions directly over functions in an infinite-dimensional space. This perspective is conceptually elegant but may feel abstract.
Now we'll develop the weight space view, which shows how GPs arise naturally from familiar Bayesian linear regression. This perspective provides computational intuition and reveals why GPs are often called 'nonparametric'—they can be understood as parametric models with infinitely many parameters.
The remarkable insight is that these two views are mathematically equivalent. They're different lenses on the same mathematical object, each illuminating different aspects of Gaussian Processes.
By the end of this page, you will understand how to derive Gaussian Processes from Bayesian linear regression, why the kernel function emerges as an inner product in feature space, and how the 'kernel trick' enables infinite-dimensional computations. You'll master the duality between function space and weight space perspectives—essential for both theoretical understanding and practical implementation.
To understand the weight space view, we begin with Bayesian linear regression—the foundation upon which we'll build.
The Model: Given input $\mathbf{x} \in \mathbb{R}^d$ and feature mapping $\boldsymbol{\phi}: \mathbb{R}^d \to \mathbb{R}^D$, we model the relationship as:
$$f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}) = \sum_{j=1}^{D} w_j \phi_j(\mathbf{x})$$
with observations corrupted by Gaussian noise:
$$y = f(\mathbf{x}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma_n^2)$$
The Bayesian Approach: Rather than finding a single optimal $\mathbf{w}$, we maintain uncertainty by placing a prior over weights:
$$\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \mathbf{S}_p)$$
where $\mathbf{S}_p$ is the prior covariance matrix. A common choice is $\mathbf{S}_p = \sigma_w^2 \mathbf{I}_D$.
The Posterior Over Weights: Given training data $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^{n}$, Bayes' theorem gives us the posterior:
$$p(\mathbf{w}|\mathbf{y}, \mathbf{X}) \propto p(\mathbf{y}|\mathbf{X}, \mathbf{w}) p(\mathbf{w})$$
Since both prior and likelihood are Gaussian, the posterior is also Gaussian:
$$\mathbf{w}|\mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)$$
where: $$\boldsymbol{\Sigma}_w = (\mathbf{S}_p^{-1} + \sigma_n^{-2} \boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1}$$ $$\boldsymbol{\mu}_w = \sigma_n^{-2} \boldsymbol{\Sigma}_w \boldsymbol{\Phi}^\top \mathbf{y}$$
Here $\boldsymbol{\Phi}$ is the $n \times D$ design matrix with rows $\boldsymbol{\phi}(\mathbf{x}_i)^\top$.
The posterior computation involves inverting a $D \times D$ matrix, where $D$ is the number of features. When $D$ is large (approaching infinity), this becomes intractable. The weight space view reveals how GPs elegantly circumvent this by working with $n \times n$ matrices instead, where $n$ is the number of data points. This is the essence of the 'kernel trick.'
Here's the key insight: a prior over weights induces a prior over functions. Let's trace this connection carefully.
The function at any point $\mathbf{x}$ is a linear combination of weights:
$$f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$$
Since $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \mathbf{S}_p)$ is Gaussian, and $f(\mathbf{x})$ is a linear function of $\mathbf{w}$, the function value $f(\mathbf{x})$ is also Gaussian:
$$f(\mathbf{x}) \sim \mathcal{N}(0, \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}))$$
Now consider any two input points $\mathbf{x}$ and $\mathbf{x}'$. Since both function values are linear in $\mathbf{w}$:
$$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \mathbb{E}[f(\mathbf{x}) f(\mathbf{x}')] - \mathbb{E}[f(\mathbf{x})]\mathbb{E}[f(\mathbf{x}')]$$
With zero-mean prior: $$\text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \mathbb{E}[\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}) \boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{w}] = \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$$
The Critical Observation: This covariance between function values defines a kernel function:
$$k(\mathbf{x}, \mathbf{x}') \triangleq \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$$
For the common case $\mathbf{S}_p = \sigma_w^2 \mathbf{I}$, this simplifies to:
$$k(\mathbf{x}, \mathbf{x}') = \sigma_w^2 \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$$
This is an inner product in feature space! The kernel measures similarity between inputs via their feature representations.
For any finite collection of inputs ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the joint distribution over function values is:
$$\begin{bmatrix} f(\mathbf{x}_1) \ \vdots \ f(\mathbf{x}_n) \end{bmatrix} \sim \mathcal{N}(\mathbf{0}, \mathbf{K})$$
where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.
This is exactly the definition of a Gaussian Process! The weight space prior induces a GP prior on functions.
We've just proved that Bayesian linear regression with feature mapping φ and Gaussian prior on weights is equivalent to a Gaussian Process with kernel k(x, x') = φ(x)ᵀ Sₚ φ(x'). This is the heart of the weight space view: GPs are Bayesian linear regression in disguise—specifically, with (potentially infinitely many) basis functions.
The weight-function duality reveals something remarkable: we never need to compute the features explicitly. All that matters is the kernel, which is the inner product in feature space.
The Kernel Trick: If we can compute $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ directly without forming the (potentially infinite-dimensional) features, we can work with arbitrarily complex feature spaces at finite computational cost.
Example: Polynomial Kernel
Consider inputs $x, x' \in \mathbb{R}$. Define:
$$k(x, x') = (1 + xx')^2$$
Expanding: $$k(x, x') = 1 + 2xx' + x^2 x'^2 = [1, \sqrt{2}x, x^2]^\top [1, \sqrt{2}x', x'^2]$$
So this kernel corresponds to features $\boldsymbol{\phi}(x) = [1, \sqrt{2}x, x^2]^\top$—polynomial features up to degree 2.
For $(1 + xx')^d$, we get features up to degree $d$. The computation is still just a single evaluation of $(1 + xx')^d$, regardless of how many features this represents!
Example: Gaussian (RBF) Kernel
The squared exponential (RBF) kernel is defined as:
$$k(x, x') = \exp\left(-\frac{(x - x')^2}{2\ell^2}\right)$$
This kernel corresponds to an infinite-dimensional feature space! Via Taylor expansion of the exponential:
$$\exp\left(-\frac{(x - x')^2}{2\ell^2}\right) = \exp\left(-\frac{x^2 + x'^2}{2\ell^2}\right) \exp\left(\frac{xx'}{\ell^2}\right)$$
$$= \exp\left(-\frac{x^2 + x'^2}{2\ell^2}\right) \sum_{n=0}^{\infty} \frac{(xx')^n}{\ell^{2n} n!}$$
The infinite sum reveals infinitely many polynomial features, each weighted by the exponential decay. The kernel computes this infinite inner product in closed form!
The kernel trick transforms what would require infinite-dimensional linear algebra into finite matrix operations. For n training points, we work with an n×n kernel matrix K, regardless of the (potentially infinite) dimensionality of the feature space. This is why GPs can represent arbitrarily complex functions while remaining computationally tractable.
| Kernel | Formula | Feature Space Dimension | Interpretation |
|---|---|---|---|
| Linear | $\mathbf{x}^\top \mathbf{x}'$ | $d$ (input dim) | Standard linear regression |
| Polynomial | $(c + \mathbf{x}^\top \mathbf{x}')^p$ | $\binom{d+p}{p}$ | Polynomials up to degree $p$ |
| RBF (Gaussian) | $\exp(-\frac{|\mathbf{x}-\mathbf{x}'|^2}{2\ell^2})$ | $\infty$ | All polynomial degrees |
| Laplacian | $\exp(-\frac{|\mathbf{x}-\mathbf{x}'|}{\ell})$ | $\infty$ | Less smooth than RBF |
| Matérn | Various forms | $\infty$ | Controlled smoothness |
Let's derive GP predictions from the weight space perspective. This derivation shows how the kernel naturally emerges.
Setup: We have training data $(\mathbf{X}, \mathbf{y})$ and want to predict at test points $\mathbf{X}_*$.
Weight space posterior (from Section 1): $$\mathbf{w}|\mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)$$
Predictive distribution for test points: $$f_* = \boldsymbol{\Phi}_* \mathbf{w}$$
where $\boldsymbol{\Phi}_*$ is the feature matrix for test points.
Since $\mathbf{w}$ is Gaussian and $f_*$ is linear in $\mathbf{w}$:
$$f_|\mathcal{D} \sim \mathcal{N}(\boldsymbol{\Phi}_ \boldsymbol{\mu}w, \boldsymbol{\Phi}* \boldsymbol{\Sigma}w \boldsymbol{\Phi}*^\top)$$
The Kernelized Form: Now we'll show this equals the standard GP prediction. Substituting the expressions for $\boldsymbol{\mu}_w$ and $\boldsymbol{\Sigma}_w$ and applying matrix identities (specifically, the Woodbury identity):
Predictive Mean: $$\mathbb{E}[f_*] = \mathbf{K}{*f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$
Predictive Covariance: $$\text{Cov}[f_*] = \mathbf{K}{**} - \mathbf{K}{f} (\mathbf{K}{ff} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}{f}$$
where:
Crucially: These formulas only involve kernel evaluations, not explicit features! We've transformed from $D \times D$ operations (potentially infinite) to $n \times n$ operations (number of training points).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npfrom scipy.linalg import solve # Setup: Polynomial regression as both weight-space and kernel-spacenp.random.seed(42) # Training dataX_train = np.array([0.5, 1.0, 2.0, 3.0, 4.0])y_train = np.array([1.2, 0.8, 1.5, 3.0, 2.8])n = len(X_train)noise_var = 0.1 # Test pointsX_test = np.linspace(0, 5, 100) # === WEIGHT SPACE VIEW ===# Polynomial features up to degree 3def polynomial_features(x, degree=3): """Compute polynomial features [1, x, x^2, ..., x^d]""" return np.column_stack([x**i for i in range(degree + 1)]) degree = 3Phi_train = polynomial_features(X_train, degree) # n x (d+1)Phi_test = polynomial_features(X_test, degree) # n_test x (d+1) # Prior covariance on weights (identity scaled)prior_var = 1.0S_p = prior_var * np.eye(degree + 1)S_p_inv = np.eye(degree + 1) / prior_var # Posterior on weightsA = S_p_inv + Phi_train.T @ Phi_train / noise_varSigma_w = np.linalg.inv(A)mu_w = Sigma_w @ Phi_train.T @ y_train / noise_var # Predictive distribution (weight space)mean_ws = Phi_test @ mu_wcov_ws = Phi_test @ Sigma_w @ Phi_test.T print("=== WEIGHT SPACE VIEW ===")print(f"Working in {degree + 1}-dimensional feature space")print(f"Posterior weight mean: {mu_w.round(3)}") # === KERNEL SPACE VIEW ===# Polynomial kernel: k(x, x') = (1 + x*x')^degreedef polynomial_kernel(x1, x2, degree=3, c=1.0): """Polynomial kernel matrix""" return (c + np.outer(x1, x2))**degree K_train = polynomial_kernel(X_train, X_train, degree) * prior_varK_test_train = polynomial_kernel(X_test, X_train, degree) * prior_varK_test = polynomial_kernel(X_test, X_test, degree) * prior_var # GP prediction (kernel space)L = np.linalg.cholesky(K_train + noise_var * np.eye(n))alpha = solve(L.T, solve(L, y_train)) mean_ks = K_test_train @ alphav = solve(L, K_test_train.T)cov_ks = K_test - v.T @ v print("\n=== KERNEL SPACE VIEW ===")print(f"Working with {n}x{n} kernel matrix (regardless of feature dim)") # Verify equivalenceprint("\n=== VERIFICATION ===")print(f"Mean difference (max): {np.max(np.abs(mean_ws - mean_ks)):.2e}")print(f"Covariance difference (max): {np.max(np.abs(cov_ws - cov_ks)):.2e}")print("Both views give IDENTICAL predictions!")The code demonstrates that weight space and kernel space computations give identical predictions—but kernel space requires only n×n matrix operations regardless of feature dimension. For infinite-dimensional kernels like RBF, weight space computation is impossible, but kernel space computation remains tractable!
Not every function of two arguments can serve as a valid kernel. For the weight space interpretation to hold, the kernel must correspond to some inner product in a feature space. This is where Mercer's theorem provides the answer.
Mercer's Theorem (informal): A symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid kernel if and only if, for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n}$, the kernel matrix $\mathbf{K}$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semi-definite.
Positive semi-definite means: For all vectors $\mathbf{c} \in \mathbb{R}^n$:
$$\mathbf{c}^\top \mathbf{K} \mathbf{c} = \sum_{i,j} c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$
Why this matters: PSD kernels guarantee that:
Kernel algebra lets you construct sophisticated kernels from simple building blocks. For example, combining a periodic kernel with an RBF decay creates functions that are 'locally periodic but globally decaying.' The weight space view guarantees that any kernel built via these operations corresponds to some (possibly complex) feature space.
The RKHS Connection: Every valid kernel defines a Reproducing Kernel Hilbert Space (RKHS)—a special function space where:
This connects GPs to regularization theory: the GP posterior mean is the function in the RKHS that minimizes a regularized loss, with the kernel controlling the regularization!
The weight space view becomes fully explicit through the spectral decomposition of the kernel. This reveals the exact basis functions and their associated 'weights.'
Mercer's Eigenvalue Theorem: For a positive semi-definite kernel $k$ on a compact domain, there exists a sequence of eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq 0$ and orthonormal eigenfunctions ${\phi_i}$ such that:
$$k(\mathbf{x}, \mathbf{x}') = \sum_{i=1}^{\infty} \lambda_i \phi_i(\mathbf{x}) \phi_i(\mathbf{x}')$$
The GP as Weighted Basis Expansion: This decomposition shows that a GP with kernel $k$ can be written as:
$$f(\mathbf{x}) = \sum_{i=1}^{\infty} \sqrt{\lambda_i} w_i \phi_i(\mathbf{x}), \quad w_i \sim \mathcal{N}(0, 1) \text{ i.i.d.}$$
where ${\phi_i}$ are the eigenfunctions and the eigenvalues ${\lambda_i}$ weight their contributions.
Intuition from Eigenvalue Decay:
For the RBF kernel, eigenvalues decay exponentially, meaning:
For polynominal kernels, only finitely many eigenvalues are non-zero—corresponding to the finite-dimensional polynomial feature space.
| Kernel | Eigenvalue Decay | Effective Dimensionality | Sample Function Properties |
|---|---|---|---|
| Linear | Finitely many non-zero | Input dimension $d$ | Linear functions |
| Polynomial (degree $p$) | Finitely many non-zero | $\binom{d+p}{p}$ | Polynomials up to degree $p$ |
| RBF (Gaussian) | Exponential: $\lambda_i \sim e^{-ci}$ | Infinite (rapidly decaying) | Infinitely smooth |
| Matérn-ν | Polynomial: $\lambda_i \sim i^{-(2\nu+d)/d}$ | Infinite (slowly decaying) | $\nu$-times differentiable |
| Periodic | Depends on smoothness | Infinite | Periodic with specified period |
There's a deep connection: smoother kernels have faster eigenvalue decay. The RBF kernel's infinitely differentiable sample functions correspond to exponential eigenvalue decay. Matérn kernels with finite smoothness have polynomial decay. This connection—via Sobolev spaces—links GP theory to classical functional analysis.
The weight space view enables a powerful approximation technique: Random Fourier Features (RFF). This method provides finite-dimensional approximations to kernels, enabling scalable GP inference.
Bochner's Theorem: A shift-invariant kernel $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$ is positive semi-definite if and only if it is the Fourier transform of a non-negative measure:
$$k(\mathbf{x} - \mathbf{x}') = \int p(\boldsymbol{\omega}) e^{i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} d\boldsymbol{\omega}$$
where $p(\boldsymbol{\omega})$ is a probability distribution called the spectral density.
For the RBF kernel with length scale $\ell$:
$$p(\boldsymbol{\omega}) = \mathcal{N}(\boldsymbol{\omega} | \mathbf{0}, \ell^{-2} \mathbf{I})$$
The Random Fourier Features Approximation:
Why this is powerful: With $m$ random features, we've reduced an $n \times n$ kernel matrix computation to an $n \times m$ feature matrix. For $m \ll n$, this dramatically speeds up GP training from $O(n^3)$ to $O(nm^2)$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npimport matplotlib.pyplot as plt # Random Fourier Features approximation to RBF kernelnp.random.seed(42) def rbf_kernel(x1, x2, length_scale=1.0): """Exact RBF kernel""" return np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2) def random_fourier_features(x, omega, b): """Compute RFF features: sqrt(2/m) * cos(omega.T @ x + b)""" m = len(b) # x: (n,), omega: (m,), b: (m,) return np.sqrt(2.0 / m) * np.cos(np.outer(x, omega) + b) # Setuplength_scale = 1.0x = np.linspace(0, 5, 50) # Exact kernel matrixK_exact = rbf_kernel(x, x, length_scale) # Compare RFF approximations with different mfig, axes = plt.subplots(2, 3, figsize=(14, 8)) for idx, m in enumerate([5, 20, 100, 500, 2000, 10000]): ax = axes[idx // 3, idx % 3] # Sample from spectral density: N(0, 1/ℓ²) omega = np.random.randn(m) / length_scale b = np.random.uniform(0, 2*np.pi, m) # Compute RFF features and approximate kernel Phi = random_fourier_features(x, omega, b) K_approx = Phi @ Phi.T # Compute error error = np.max(np.abs(K_exact - K_approx)) # Plot im = ax.imshow(K_approx, cmap='viridis') ax.set_title(f'm = {m}\nMax Error: {error:.4f}') ax.set_xlabel('x index') ax.set_ylabel('x index') plt.suptitle('Random Fourier Features Approximation to RBF Kernel', fontsize=14)plt.tight_layout()plt.show() print("As m increases, RFF approximation converges to the exact kernel.")print("This enables weight-space GP computation for infinite-dimensional kernels!")Random Fourier Features bridge the conceptual elegance of the weight space view with scalable computation. With m features, you can: (1) train GP-like models with O(nm²) complexity instead of O(n³), (2) use standard linear regression solvers, (3) apply stochastic gradient methods for massive datasets. This has enabled GP-scale uncertainty quantification in deep learning and large-scale optimization.
We've now developed both perspectives on Gaussian Processes. Let's consolidate their relationship and understand when each view is most useful.
The Mathematical Equivalence:
| Function Space View | Weight Space View |
|---|---|
| Distribution over functions $p(f)$ | Distribution over weights $p(\mathbf{w})$ inducing functions |
| Specified by mean $m(\mathbf{x})$ and kernel $k(\mathbf{x}, \mathbf{x}')$ | Specified by prior $p(\mathbf{w})$ and features $\boldsymbol{\phi}(\mathbf{x})$ |
| Kernel $k(\mathbf{x}, \mathbf{x}')$ | Inner product $\boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$ |
| Works with infinite-dimensional function spaces | May require infinitely many basis functions |
| Computation: $O(n^3)$ via kernel matrices | Computation: $O(D^3)$ via weight posteriors |
Key Insight: The views are equivalent when $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \mathbf{S}_p \boldsymbol{\phi}(\mathbf{x}')$. Every kernel corresponds to some feature space (and vice versa), but the kernel representation is often more compact.
Expert GP practitioners fluently switch between views. Use function space for model building and interpretation—it's easier to reason about what kinds of functions your kernel implies. Use weight space for computation and approximation—especially when scaling to large datasets. The duality is a feature, not a bug.
A beautiful connection emerges from the weight space view: neural networks become Gaussian Processes in the infinite-width limit. This theoretical result bridges deep learning and kernel methods.
Neal's Observation (1996): Consider a single hidden-layer neural network:
$$f(\mathbf{x}) = \sum_{j=1}^{H} v_j \sigma(\mathbf{w}_j^\top \mathbf{x})$$
where $\mathbf{w}_j, v_j$ are weights initialized i.i.d. from distributions with zero mean, and $\sigma$ is a nonlinear activation.
As $H \to \infty$: By the Central Limit Theorem, the sum of many independent random terms converges to a Gaussian. Thus:
$$f(\mathbf{x}) \xrightarrow{d} \mathcal{GP}(0, k_{\text{NN}})$$
where $k_{\text{NN}}$ is a kernel determined by the activation function and weight initialization.
The Neural Network Kernel: For ReLU activations with standard initialization:
$$k_{\text{NN}}(\mathbf{x}, \mathbf{x}') = \frac{|\mathbf{x}| |\mathbf{x}'|}{2\pi} (\sin \theta + (\pi - \theta) \cos \theta)$$
where $\theta = \cos^{-1}\left(\frac{\mathbf{x}^\top \mathbf{x}'}{|\mathbf{x}| |\mathbf{x}'|}\right)$ is the angle between inputs.
For Deep Networks (Jacot et al., 2018): The connection extends to deep networks via the Neural Tangent Kernel (NTK). Infinite-width deep networks trained by gradient descent behave like kernel regression with a specific kernel determined by the architecture.
Implications:
This connection isn't just theoretical curiosity. It explains why neural networks generalize well, provides initialization schemes (to start near the GP regime), and suggests that somewhere between 'exactly GP' and 'finite-width NN' lies a tradeoff between analytical tractability and expressive power. Understanding both paradigms enriches your intuition for both.
We've now developed the complete weight space view of Gaussian Processes. Here are the essential takeaways:
With both function space and weight space views established, we're ready to formally define Gaussian Processes and their key mathematical properties. The next page presents the rigorous GP definition, bringing together the intuitions we've developed into a complete mathematical framework.
You now understand the weight space perspective on Gaussian Processes—showing how GPs emerge from Bayesian linear regression with (potentially infinite) basis functions. This view provides computational intuition, connects to neural networks, and enables scalable approximations. Together with the function space view, you have complete dual perspectives on GP fundamentals.