Machine LearningGaussian Processes

Gaussian Process Fundamentals

LevelAdvanced

Duration90 mins

TopicGaussian Processes

3 / 5

GP Definition

The Formal Definition of a Gaussian Process

Having developed intuition from both function space and weight space perspectives, we now state the formal definition of a Gaussian Process. This definition is mathematically precise yet beautifully simple—it captures everything essential about GPs in a single statement.

Understanding this definition deeply is crucial: every GP computation, every algorithm, every application ultimately derives from this foundational statement.

The Definition

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

— Rasmussen & Williams, Gaussian Processes for Machine Learning (2006)

This definition is deceptively compact. Let's unpack each component:

'Collection of random variables': The random variables are the function values ${f(\mathbf{x}) : \mathbf{x} \in \mathcal{X}}$ for all inputs in some domain $\mathcal{X}$. There are potentially uncountably many such variables (one for every possible input).
'Any finite number': We can pick any finite subset of inputs ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$ and consider the corresponding function values ${f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)}$.
'Joint Gaussian distribution': The vector $[f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top$ follows a multivariate Gaussian distribution for any choice of inputs and any $n$.

What You Will Learn

By the end of this page, you will completely understand the GP definition, its mathematical implications, the notation conventions, and how this definition enables tractable computation over infinite-dimensional function spaces. You'll master the specification of GPs through their mean and covariance functions.

Complete Specification of a Gaussian Process

A GP is completely specified by two functions: a mean function and a covariance function (also called the kernel).

Mean Function $m: \mathcal{X} \to \mathbb{R}$: $$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$

This gives the expected value of the function at any input. It represents our prior belief about the 'central' function before seeing data.

Covariance Function (Kernel) $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$: $$k(\mathbf{x}, \mathbf{x}') = \text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \mathbb{E}[(f(\mathbf{x}) - m(\mathbf{x}))(f(\mathbf{x}') - m(\mathbf{x}'))]$$

This gives the covariance between function values at any two inputs. It encodes our prior beliefs about function properties: smoothness, periodicity, variation scale.

Standard Notation: We write: $$f \sim \mathcal{GP}(m, k)$$

to denote that $f$ is distributed as a Gaussian Process with mean function $m$ and covariance function $k$.

Why This Specification is Complete

•Multivariate Gaussians are fully specified by mean and covariance: For any finite set of inputs, the joint distribution is $\mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$ where $\mu_i = m(\mathbf{x}i)$ and $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.
•Consistency is automatic: The marginal of a multivariate Gaussian is also Gaussian with the appropriate sub-mean and sub-covariance. The family of finite-dimensional distributions is internally consistent.
•Kolmogorov's Extension Theorem applies: Given consistent finite-dimensional distributions, a unique infinite-dimensional process exists. We never need to construct it explicitly.
•All GP properties derive from $m$ and $k$: Smoothness, stationarity, periodicity, long-range correlations—all are encoded in these two functions.

Practical Implication

To fully define a GP model, you need only specify m(x) and k(x, x'). Everything else—priors, posteriors, predictions, uncertainty—follows from manipulating multivariate Gaussians. The art of GP modeling is choosing m and k to reflect your prior beliefs about the function you're trying to learn.

The Joint Distribution at Finite Points

Given a GP $f \sim \mathcal{GP}(m, k)$ and any finite set of inputs $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the GP definition tells us:

$$\mathbf{f} = \begin{bmatrix} f(\mathbf{x}_1) \ f(\mathbf{x}_2) \ \vdots \ f(\mathbf{x}_n) \end{bmatrix} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$

where:

Mean vector: $$\boldsymbol{\mu} = \begin{bmatrix} m(\mathbf{x}_1) \ m(\mathbf{x}_2) \ \vdots \ m(\mathbf{x}_n) \end{bmatrix}$$

Covariance matrix (Gram matrix): $$\mathbf{K} = \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \cdots & k(\mathbf{x}_2, \mathbf{x}_n) \ \vdots & \vdots & \ddots & \vdots \ k(\mathbf{x}_n, \mathbf{x}_1) & k(\mathbf{x}_n, \mathbf{x}_2) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{bmatrix}$$

Properties of the Gram Matrix:

Symmetric: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$, so $\mathbf{K} = \mathbf{K}^\top$
Positive semi-definite: For any vector $\mathbf{a}$, $\mathbf{a}^\top \mathbf{K} \mathbf{a} \geq 0$. This is required for $\mathbf{K}$ to be a valid covariance matrix.
Diagonal gives variances: $K_{ii} = k(\mathbf{x}_i, \mathbf{x}_i) = \text{Var}[f(\mathbf{x}_i)]$
Off-diagonal gives correlations: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ measures how correlated function values are at different inputs

Compact Notation: $$\mathbf{K} = k(\mathbf{X}, \mathbf{X})$$

where $k(\mathbf{X}, \mathbf{X}')$ denotes the matrix with entries $[k(\mathbf{X}, \mathbf{X}')]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j')$.

gp_definition_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy.linalg import cholesky
import matplotlib.pyplot as plt
 
# Define a Gaussian Process
class GaussianProcess:
    """
    A Gaussian Process specified by mean and covariance functions.
    
    f ~ GP(m, k)
    
    For any finite set of inputs X, f(X) ~ N(m(X), k(X, X))
    """
    
    def __init__(self, mean_fn, kernel_fn):
        """
        Args:
            mean_fn: m(x) -> scalar mean at input x
            kernel_fn: k(x1, x2) -> covariance between inputs
        """
        self.mean_fn = mean_fn
        self.kernel_fn = kernel_fn
    
    def mean_vector(self, X):
        """Compute mean vector μ = [m(x₁), ..., m(xₙ)]ᵀ"""
        return np.array([self.mean_fn(x) for x in X])
    
    def gram_matrix(self, X1, X2=None):
        """
        Compute Gram matrix K with entries K[i,j] = k(X1[i], X2[j])
        If X2 is None, computes k(X1, X1)
        """
        if X2 is None:
            X2 = X1
        n1, n2 = len(X1), len(X2)
        K = np.zeros((n1, n2))
        for i in range(n1):
            for j in range(n2):
                K[i, j] = self.kernel_fn(X1[i], X2[j])
        return K
    
    def sample(self, X, n_samples=1):
        """
        Sample functions from GP at inputs X.
        Returns: (n_samples, len(X)) array of function values
        """
        mu = self.mean_vector(X)
        K = self.gram_matrix(X) + 1e-8 * np.eye(len(X))  # jitter for stability
        L = cholesky(K, lower=True)
        
        samples = []
        for _ in range(n_samples):
            z = np.random.randn(len(X))
            f = mu + L @ z  # f ~ N(μ, K)
            samples.append(f)
        return np.array(samples)
 
# Example: GP with zero mean and RBF kernel
def zero_mean(x):
    return 0.0
 
def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    return variance * np.exp(-0.5 * (x1 - x2)**2 / length_scale**2)
 
# Create GP
gp = GaussianProcess(
    mean_fn=zero_mean,
    kernel_fn=lambda x1, x2: rbf_kernel(x1, x2, length_scale=1.0, variance=1.0)
)
 
# Sample at finite points
X = np.linspace(0, 5, 100)
samples = gp.sample(X, n_samples=5)
 
# Visualize
plt.figure(figsize=(12, 5))
for i, sample in enumerate(samples):
    plt.plot(X, sample, label=f'Sample {i+1}')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('GP Definition: Samples from f ~ GP(0, k_RBF)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
 
print("Each sample is drawn from the n-dimensional Gaussian N(μ, K)")
print(f"Here n = {len(X)}, so we're sampling from a {len(X)}-dim Gaussian.")

Key Mathematical Properties

The GP definition implies several fundamental properties that make GPs analytically tractable and practically powerful.

Property 1: Marginalization

If $f \sim \mathcal{GP}(m, k)$, then for any subset of inputs $\mathbf{X}_A \subset \mathbf{X}$:

$$f(\mathbf{X}_A) \sim \mathcal{N}(m(\mathbf{X}_A), k(\mathbf{X}_A, \mathbf{X}_A))$$

This follows from the marginalization property of multivariate Gaussians: dropping variables just means dropping the corresponding rows/columns from mean vector and covariance matrix.

Property 2: Conditioning

Given observations $\mathbf{f}_A = f(\mathbf{X}_A)$, the conditional distribution of $\mathbf{f}_B = f(\mathbf{X}_B)$ is:

$$\mathbf{f}B | \mathbf{f}A \sim \mathcal{N}(\boldsymbol{\mu}{B|A}, \mathbf{K}{B|A})$$

where: $$\boldsymbol{\mu}{B|A} = m(\mathbf{X}B) + \mathbf{K}{BA} \mathbf{K}{AA}^{-1} (\mathbf{f}A - m(\mathbf{X}A))$$ $$\mathbf{K}{B|A} = \mathbf{K}{BB} - \mathbf{K}{BA} \mathbf{K}{AA}^{-1} \mathbf{K}_{AB}$$

This is the GP posterior—the updated distribution after observing data.

Property 3: Linear Operations Preserve Gaussianity

If $f \sim \mathcal{GP}(m, k)$ and $g(\mathbf{x}) = \int w(\mathbf{x}, \mathbf{z}) f(\mathbf{z}) d\mathbf{z}$ is a linear functional of $f$, then $g$ is also a GP:

$$g \sim \mathcal{GP}(m_g, k_g)$$

with: $$m_g(\mathbf{x}) = \int w(\mathbf{x}, \mathbf{z}) m(\mathbf{z}) d\mathbf{z}$$ $$k_g(\mathbf{x}, \mathbf{x}') = \iint w(\mathbf{x}, \mathbf{z}) k(\mathbf{z}, \mathbf{z}') w(\mathbf{x}', \mathbf{z}') d\mathbf{z} d\mathbf{z}'$$

Important special case: Derivatives

If $f \sim \mathcal{GP}(m, k)$ has sufficiently smooth sample paths, then:

$$\frac{\partial f}{\partial x_i} \sim \mathcal{GP}\left(\frac{\partial m}{\partial x_i}, \frac{\partial^2 k}{\partial x_i \partial x_i'}\right)$$

Derivatives of GPs are also GPs! This enables modeling of physical systems with gradient observations.

Summary of Key GP Properties
Property	Statement	Use Case
Marginalization	Subset distributions are Gaussian	Ignore unobserved locations freely
Conditioning	Conditional distributions are Gaussian	Posterior inference given observations
Linear Functionals	Linear operations yield GPs	Integrals, derivatives, convolutions
Affine Transformation	$af + b$ is GP if $f$ is GP	Scaling and shifting predictions
Sum of Independent	$f + g$ is GP if $f, g$ independent GPs	Combining signal and noise models

Why These Properties Matter

These properties are why GPs are computationally tractable. Unlike arbitrary distributions over functions, Gaussians have closed-form marginals, conditionals, and linear transformations. Every GP calculation reduces to multivariate Gaussian algebra—matrix multiplications and inversions—which we know how to compute efficiently.

Stationarity and Isotropy

Many commonly used kernels have special structure that simplifies analysis and interpretation.

Stationary Kernels:

A kernel is stationary if it depends only on the difference between inputs:

$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\tau})$$

where $\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}'$ is the displacement vector.

Physical interpretation: The statistical properties of the function (variance, correlations) don't change as we 'shift' through input space. The function looks statistically similar at $x = 0$, $x = 100$, or anywhere else.

Isotropic (Radial) Kernels:

A kernel is isotropic if it depends only on the distance between inputs:

$$k(\mathbf{x}, \mathbf{x}') = k(|\mathbf{x} - \mathbf{x}'|) = k(r)$$

where $r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance.

Physical interpretation: Correlations depend only on how far apart points are, not their direction. The function has no preferred orientation.

Common Kernel Classifications

•Stationary + Isotropic: RBF, Matérn, Rational Quadratic — depends on $|\mathbf{x} - \mathbf{x}'|$
•Stationary + Anisotropic: ARD kernels — different length scales per dimension, $k = \exp(-\sum_d (x_d - x_d')^2 / 2\ell_d^2)$
•Non-stationary: Linear kernel $\mathbf{x}^\top \mathbf{x}'$, Neural Network kernel — properties vary with location
•Periodic: $\exp(-2\sin^2(\pi|x-x'|/p) / \ell^2)$ — stationary on the circle, captures periodic patterns

Non-Stationary Kernels:

Sometimes stationarity is inappropriate—the function may behave differently in different regions. Examples:

Linear Kernel: $k(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}'$
- Variance increases away from origin
- Models functions that grow linearly
Neural Network Kernel: Derived from infinite-width neural networks
- Variance structure matches network architecture
- Non-stationary by construction
Input-Dependent Length Scales: $k(\mathbf{x}, \mathbf{x}') = \sigma^2(\mathbf{x}) \sigma^2(\mathbf{x}') \exp(-|\mathbf{x} - \mathbf{x}'|^2 / (\ell(\mathbf{x}) + \ell(\mathbf{x}'))^2)$
- Length scale varies with position
- Useful when smoothness varies across input space

Choosing Between Stationary and Non-Stationary

Use stationary kernels as a default—they're simpler, have fewer parameters, and work well for many problems. Switch to non-stationary kernels when you have evidence that function properties genuinely vary across the input domain: trend lines that grow over time, signals that become more variable in certain regions, or physical systems with spatially varying characteristics.

Sample Path Properties

The kernel determines not just correlations but also the regularity of sample functions—how smooth, continuous, or differentiable they are.

Continuity: For a zero-mean GP with stationary kernel $k(r)$, sample paths are almost surely continuous if:

$$k(0) - k(r) = O(|\log r|^{-(1+\epsilon)})$$

as $r \to 0$, for some $\epsilon > 0$. All commonly used kernels satisfy this, so GP samples are typically continuous.

Mean-Square Differentiability: A GP is mean-square differentiable if:

$$\lim_{h \to 0} \mathbb{E}\left[\left(\frac{f(x+h) - f(x)}{h} - f'(x)\right)^2\right] = 0$$

This occurs if and only if $\frac{\partial^2 k}{\partial x \partial x'}$ exists and is finite at $x = x'$.

For the RBF kernel: $$\frac{\partial^2}{\partial x \partial x'} \exp\left(-\frac{(x-x')^2}{2\ell^2}\right) \bigg|_{x=x'} = \frac{1}{\ell^2}$$

This is finite, so RBF samples are infinitely differentiable!

Kernel Smoothness and Sample Path Properties
Kernel	Smoothness Parameter	Differentiability	Sample Path Character
Exponential (Matérn 1/2)	$ u = 1/2$	Continuous, not differentiable	Rough, jagged paths
Matérn 3/2	$ u = 3/2$	Once differentiable	Moderately smooth
Matérn 5/2	$ u = 5/2$	Twice differentiable	Smooth curves
RBF (Squared Exponential)	$ u = \infty$	Infinitely differentiable	Ultra-smooth
Periodic + RBF	$ u = \infty$	Infinitely differentiable	Smooth and periodic

The Smoothness-Realism Tradeoff

The RBF kernel produces extremely smooth samples—infinitely differentiable everywhere. This is often unrealistic for physical phenomena. Financial data, weather patterns, and biological signals typically have finite smoothness. The Matérn family provides a smoothness parameter ν that lets you match the realism of your model to the data. Matérn-5/2 is a popular 'sweet spot' for many applications.

kernel_smoothness_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
from scipy.special import kv, gamma
 
def rbf_kernel(x1, x2, length_scale=1.0):
    """Infinitely differentiable (nu=infinity)"""
    return np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2)
 
def matern_kernel(x1, x2, nu=2.5, length_scale=1.0):
    """Matérn kernel with smoothness parameter nu"""
    r = np.abs(np.subtract.outer(x1, x2))
    r = np.clip(r, 1e-10, None)  # Avoid division by zero
    
    if nu == 0.5:  # Exponential
        return np.exp(-r / length_scale)
    elif nu == 1.5:
        sqrt3 = np.sqrt(3)
        return (1 + sqrt3 * r / length_scale) * np.exp(-sqrt3 * r / length_scale)
    elif nu == 2.5:
        sqrt5 = np.sqrt(5)
        return (1 + sqrt5 * r / length_scale + 5 * r**2 / (3 * length_scale**2)) * \
               np.exp(-sqrt5 * r / length_scale)
    else:
        # General Matérn (computationally expensive)
        coef = (2**(1-nu)) / gamma(nu)
        arg = np.sqrt(2*nu) * r / length_scale
        return coef * (arg**nu) * kv(nu, arg)
 
# Setup
np.random.seed(42)
x = np.linspace(0, 5, 200)
n = len(x)
 
kernels = [
    ("Matérn 1/2 (rough)", lambda x1, x2: matern_kernel(x1, x2, nu=0.5)),
    ("Matérn 3/2", lambda x1, x2: matern_kernel(x1, x2, nu=1.5)),
    ("Matérn 5/2", lambda x1, x2: matern_kernel(x1, x2, nu=2.5)),
    ("RBF (ultra-smooth)", rbf_kernel),
]
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
 
for idx, (name, kernel_fn) in enumerate(kernels):
    K = kernel_fn(x, x) + 1e-8 * np.eye(n)
    L = cholesky(K, lower=True)
    
    ax = axes[idx]
    for i in range(3):
        sample = L @ np.random.randn(n)
        ax.plot(x, sample, alpha=0.8, linewidth=1.5)
    
    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('f(x)', fontsize=12)
    ax.set_title(f'{name}', fontsize=14)
    ax.grid(True, alpha=0.3)
    ax.set_ylim([-3, 3])
 
plt.suptitle('GP Sample Paths: Smoothness Depends on Kernel', fontsize=16)
plt.tight_layout()
plt.show()
 
print("Notice how sample roughness increases as nu decreases.")
print("The kernel completely determines sample path regularity!")

The Zero-Mean Assumption

In practice, most GP models use a zero mean function: $m(\mathbf{x}) = 0$. This might seem restrictive—surely real functions aren't centered at zero everywhere?—but there are good reasons for this choice.

Why Zero Mean Works:

Data centering: If we subtract the empirical mean from observations, the residuals have approximately zero mean. The GP then models these centered residuals.
Posterior adaptation: The GP posterior mean adapts to data, so even with zero prior mean, predictions are non-zero where observations inform us.
Far-field behavior: Where data is sparse, predictions revert to the prior mean. If you want predictions to be zero far from data (rather than some arbitrary constant), zero mean is appropriate.
Fewer hyperparameters: Adding a parametric mean function introduces more parameters to optimize. Zero mean keeps the model simpler.

When Non-Zero Mean is Appropriate:

Known trends: If physics or domain knowledge suggests a specific trend (linear growth, exponential decay), incorporate it as the mean function.
Extrapolation: Far from data, the posterior reverts to the prior mean. If you want specific extrapolation behavior, encode it in $m(\mathbf{x})$.
Hierarchical models: In some cases, the mean function itself is uncertain and given a prior.

The General Formulation:

$$f(\mathbf{x}) = m(\mathbf{x}) + g(\mathbf{x}), \quad g \sim \mathcal{GP}(0, k)$$

where $m(\mathbf{x})$ is a fixed (or parameterized) mean function and $g$ is a zero-mean GP. This decomposition separates:

Trend: captured by $m(\mathbf{x})$
Variation around trend: captured by the GP $g(\mathbf{x})$

Practical Recommendation

Start with a zero-mean GP after centering your data. If model fit is poor or extrapolation is unrealistic, consider adding a parametric mean (linear, polynomial, or domain-specific). The mean function lets you inject structural knowledge; the GP captures everything else.

Standard Notation Conventions

GP literature uses consistent notation that's worth memorizing. Here's the standard convention:

Training Data:

$\mathbf{X}$: $n \times d$ matrix of training inputs (rows are data points)
$\mathbf{y}$: $n \times 1$ vector of training targets
$n$: number of training points
$d$: input dimension

Test Data:

$\mathbf{X}*$ or $\mathbf{X}'$: $n* \times d$ matrix of test inputs
$\mathbf{f}_*$: function values at test points (the quantity we want to predict)
$n_*$: number of test points

Kernel/Covariance Matrices:

$\mathbf{K}$ or $\mathbf{K}_{ff}$: kernel matrix between training points, $n \times n$
$\mathbf{K}*$ or $\mathbf{K}{f}$: kernel matrix between test and training points, $n_ \times n$
$\mathbf{K}{**}$: kernel matrix between test points, $n* \times n_*$

GP Notation Quick Reference
Symbol	Dimension	Description
$f$	function	Latent function (GP distributed)
$y$	scalar/vector	Noisy observation(s)
$\sigma_n^2$	scalar	Observation noise variance
$m(\mathbf{x})$	function → scalar	Mean function
$k(\mathbf{x}, \mathbf{x}')$	function → scalar	Covariance function (kernel)
$\boldsymbol{\theta}$	vector	Kernel hyperparameters
$\mathbf{K}$	$n \times n$	Gram matrix $[k(\mathbf{x}_i, \mathbf{x}_j)]$
$\mathbf{K}_y$	$n \times n$	$\mathbf{K} + \sigma_n^2 \mathbf{I}$ (with noise)
$\boldsymbol{\alpha}$	$n \times 1$	$\mathbf{K}_y^{-1} \mathbf{y}$ (precomputed for efficiency)

The Standard GP Prediction Equations:

Given training data $(\mathbf{X}, \mathbf{y})$ and test points $\mathbf{X}_*$, with observation model $y = f(\mathbf{x}) + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:

Posterior Mean: $$\bar{\mathbf{f}}* = \mathbf{K}* (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} = \mathbf{K}_* \boldsymbol{\alpha}$$

Posterior Covariance: $$\text{Cov}(\mathbf{f}*) = \mathbf{K}{**} - \mathbf{K}* (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}*^\top$$

These are the equations you'll use for virtually every GP application. They're derived from Gaussian conditioning (which we'll detail when discussing GP regression).

Existence and Uniqueness

The GP definition relies on deep results from probability theory that guarantee our constructions are mathematically valid.

The Kolmogorov Extension Theorem:

Let ${P_{\mathbf{x}_1, \ldots, \mathbf{x}_n}}$ be a family of probability distributions on $\mathbb{R}^n$, indexed by finite subsets of a set $\mathcal{X}$. If this family satisfies:

Consistency under permutation: Reordering variables doesn't change the distribution
Consistency under marginalization: Marginalizing matches the distribution on fewer variables

Then there exists a unique probability measure on $\mathbb{R}^{\mathcal{X}}$ (the space of all functions from $\mathcal{X}$ to $\mathbb{R}$) whose finite-dimensional marginals are exactly the given distributions.

For GPs: We specify multivariate Gaussians for all finite subsets, determined by $m$ and $k$. Gaussians automatically satisfy consistency (marginals of Gaussians are Gaussian with the right parameters). Therefore, a unique GP exists!

Implications for Practitioners:

Existence is guaranteed: If $k$ is a valid kernel (positive semi-definite), a GP with that kernel exists.
Uniqueness is guaranteed: Any two GPs with the same $m$ and $k$ are the same stochastic process.
Finite computation suffices: We never need to 'construct' the infinite-dimensional process. Finite-dimensional operations with consistent specifications are enough.
The kernel determines everything: Given a valid kernel, all properties of the GP (sample smoothness, long-range behavior, etc.) are fully determined.

What can go wrong:

If you specify a function $k$ that isn't positive semi-definite, the resulting 'covariance matrix' at some point set might have negative eigenvalues—an invalid covariance. Always use valid kernels or verify PSD-ness.

Mathematical Foundation

Kolmogorov's theorem is what makes GP theory rigorous. Without it, 'distribution over functions' would be hand-waving. With it, we have precise mathematical objects with well-defined properties. The theorem ensures that our finite-dimensional intuitions (sampling, conditioning, marginalization) extend coherently to the full infinite-dimensional setting.

GPs vs Other Stochastic Processes

Gaussian Processes are one family in a broader landscape of stochastic processes. Understanding what makes GPs special clarifies their strengths and limitations.

Why 'Gaussian'?

The choice of Gaussian distributions isn't arbitrary—it provides unique computational advantages:

Closure under conditioning: Conditioning a Gaussian on observed variables yields another Gaussian. This makes posterior inference exact and tractable.
Closure under marginalization: Marginalizing out variables from a Gaussian yields a Gaussian. We can ignore unobserved locations without approximation.
Closure under linear operations: Sum, difference, integral, derivative of Gaussians are Gaussian. This enables modeling of complex physical relationships.
Maximum entropy: Among distributions with specified mean and variance, the Gaussian has maximum entropy. It's the 'least informative' choice given first and second moments—a form of principled agnosticism.

Comparison with Other Processes:

Process	Finite-Dim Distributions	Posterior Inference	Typical Use
Gaussian Process	Multivariate Gaussian	Exact (closed form)	Regression, optimization
Wiener Process	Gaussian (special case)	Exact	Brownian motion modeling
Poisson Process	Poisson	Various	Count data, event modeling
Cox Process	Poisson with random rate	Often intractable	Spatial point patterns
Student-t Process	Multivariate t	Exact but heavier tails	Robust regression
Dirichlet Process	Dirichlet	MCMC typically	Clustering, density estimation

Key GP Advantages:

Analytical tractability for most operations
Well-understood theory and algorithms
Principled uncertainty quantification
Flexibility through kernel design

Key GP Limitations:

Assumes Gaussian observation noise (extensions exist)
Scalability challenges ($O(n^3)$ training)
Cannot model discrete or count data directly

When GPs Excel

GPs are ideal when: (1) you need uncertainty quantification, not just predictions; (2) data is limited and you can't afford overconfident extrapolation; (3) function properties like smoothness are known or can be learned; (4) the problem involves continuous inputs and outputs. For other scenarios, consider extensions (sparse GPs, non-Gaussian likelihoods) or alternative models.

Summary: The GP Definition

We've now established the formal definition of Gaussian Processes and explored its mathematical implications. Here are the core takeaways:

Core Concepts Mastered

•The GP definition: A GP is a collection of random variables, any finite number of which are jointly Gaussian. This single statement implies everything about GP behavior.
•Complete specification via m and k: The mean function and kernel completely determine the GP. No additional parameters or distributions are needed.
•Gaussian closure properties: Marginalization, conditioning, and linear operations all preserve Gaussianity—making GP computations exact and tractable.
•Stationarity and isotropy: These structural assumptions simplify kernels and interpretation, though non-stationary alternatives exist.
•Sample path smoothness: The kernel determines function regularity. RBF gives ultra-smooth samples; Matérn provides tunable smoothness.
•Zero-mean convention: Common and practical after data centering, though parametric means can encode known trends.
•Kolmogorov guarantee: The extension theorem ensures GPs exist and are unique given valid kernels.

What's Next

With the GP definition firmly established, we're ready to explore the critical choice in any GP model: the mean and covariance functions. The next page dives deep into common kernels, their properties, and how to choose and combine them to encode your prior beliefs about the function you're trying to learn.

Page Complete

You now have rigorous understanding of the Gaussian Process definition and its mathematical properties. This foundational knowledge underlies all GP algorithms and applications. Proceed to learn about the mean and covariance functions that give GPs their expressive power.

3 / 5

Loading learning content...

Machine LearningGaussian Processes

Gaussian Process Fundamentals

LevelAdvanced

Duration90 mins

TopicGaussian Processes

3 / 5

GP Definition

The Formal Definition of a Gaussian Process

Understanding this definition deeply is crucial: every GP computation, every algorithm, every application ultimately derives from this foundational statement.

The Definition

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

— Rasmussen & Williams, Gaussian Processes for Machine Learning (2006)

This definition is deceptively compact. Let's unpack each component:

'Collection of random variables': The random variables are the function values ${f(\mathbf{x}) : \mathbf{x} \in \mathcal{X}}$ for all inputs in some domain $\mathcal{X}$. There are potentially uncountably many such variables (one for every possible input).
'Any finite number': We can pick any finite subset of inputs ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$ and consider the corresponding function values ${f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)}$.
'Joint Gaussian distribution': The vector $[f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top$ follows a multivariate Gaussian distribution for any choice of inputs and any $n$.

What You Will Learn

Complete Specification of a Gaussian Process

A GP is completely specified by two functions: a mean function and a covariance function (also called the kernel).

Mean Function $m: \mathcal{X} \to \mathbb{R}$: $$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$

This gives the expected value of the function at any input. It represents our prior belief about the 'central' function before seeing data.

This gives the covariance between function values at any two inputs. It encodes our prior beliefs about function properties: smoothness, periodicity, variation scale.

Standard Notation: We write: $$f \sim \mathcal{GP}(m, k)$$

to denote that $f$ is distributed as a Gaussian Process with mean function $m$ and covariance function $k$.

Why This Specification is Complete

•Multivariate Gaussians are fully specified by mean and covariance: For any finite set of inputs, the joint distribution is $\mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$ where $\mu_i = m(\mathbf{x}i)$ and $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.
•Consistency is automatic: The marginal of a multivariate Gaussian is also Gaussian with the appropriate sub-mean and sub-covariance. The family of finite-dimensional distributions is internally consistent.
•Kolmogorov's Extension Theorem applies: Given consistent finite-dimensional distributions, a unique infinite-dimensional process exists. We never need to construct it explicitly.
•All GP properties derive from $m$ and $k$: Smoothness, stationarity, periodicity, long-range correlations—all are encoded in these two functions.

Practical Implication

The Joint Distribution at Finite Points

Given a GP $f \sim \mathcal{GP}(m, k)$ and any finite set of inputs $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the GP definition tells us:

$$\mathbf{f} = \begin{bmatrix} f(\mathbf{x}_1) \ f(\mathbf{x}_2) \ \vdots \ f(\mathbf{x}_n) \end{bmatrix} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$

where:

Mean vector: $$\boldsymbol{\mu} = \begin{bmatrix} m(\mathbf{x}_1) \ m(\mathbf{x}_2) \ \vdots \ m(\mathbf{x}_n) \end{bmatrix}$$

Properties of the Gram Matrix:

Symmetric: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$, so $\mathbf{K} = \mathbf{K}^\top$
Positive semi-definite: For any vector $\mathbf{a}$, $\mathbf{a}^\top \mathbf{K} \mathbf{a} \geq 0$. This is required for $\mathbf{K}$ to be a valid covariance matrix.
Diagonal gives variances: $K_{ii} = k(\mathbf{x}_i, \mathbf{x}_i) = \text{Var}[f(\mathbf{x}_i)]$
Off-diagonal gives correlations: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ measures how correlated function values are at different inputs

Compact Notation: $$\mathbf{K} = k(\mathbf{X}, \mathbf{X})$$

where $k(\mathbf{X}, \mathbf{X}')$ denotes the matrix with entries $[k(\mathbf{X}, \mathbf{X}')]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j')$.

gp_definition_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy.linalg import cholesky
import matplotlib.pyplot as plt
 
# Define a Gaussian Process
class GaussianProcess:
    """
    A Gaussian Process specified by mean and covariance functions.
    
    f ~ GP(m, k)
    
    For any finite set of inputs X, f(X) ~ N(m(X), k(X, X))
    """
    
    def __init__(self, mean_fn, kernel_fn):
        """
        Args:
            mean_fn: m(x) -> scalar mean at input x
            kernel_fn: k(x1, x2) -> covariance between inputs
        """
        self.mean_fn = mean_fn
        self.kernel_fn = kernel_fn
    
    def mean_vector(self, X):
        """Compute mean vector μ = [m(x₁), ..., m(xₙ)]ᵀ"""
        return np.array([self.mean_fn(x) for x in X])
    
    def gram_matrix(self, X1, X2=None):
        """
        Compute Gram matrix K with entries K[i,j] = k(X1[i], X2[j])
        If X2 is None, computes k(X1, X1)
        """
        if X2 is None:
            X2 = X1
        n1, n2 = len(X1), len(X2)
        K = np.zeros((n1, n2))
        for i in range(n1):
            for j in range(n2):
                K[i, j] = self.kernel_fn(X1[i], X2[j])
        return K
    
    def sample(self, X, n_samples=1):
        """
        Sample functions from GP at inputs X.
        Returns: (n_samples, len(X)) array of function values
        """
        mu = self.mean_vector(X)
        K = self.gram_matrix(X) + 1e-8 * np.eye(len(X))  # jitter for stability
        L = cholesky(K, lower=True)
        
        samples = []
        for _ in range(n_samples):
            z = np.random.randn(len(X))
            f = mu + L @ z  # f ~ N(μ, K)
            samples.append(f)
        return np.array(samples)
 
# Example: GP with zero mean and RBF kernel
def zero_mean(x):
    return 0.0
 
def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    return variance * np.exp(-0.5 * (x1 - x2)**2 / length_scale**2)
 
# Create GP
gp = GaussianProcess(
    mean_fn=zero_mean,
    kernel_fn=lambda x1, x2: rbf_kernel(x1, x2, length_scale=1.0, variance=1.0)
)
 
# Sample at finite points
X = np.linspace(0, 5, 100)
samples = gp.sample(X, n_samples=5)
 
# Visualize
plt.figure(figsize=(12, 5))
for i, sample in enumerate(samples):
    plt.plot(X, sample, label=f'Sample {i+1}')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('GP Definition: Samples from f ~ GP(0, k_RBF)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
 
print("Each sample is drawn from the n-dimensional Gaussian N(μ, K)")
print(f"Here n = {len(X)}, so we're sampling from a {len(X)}-dim Gaussian.")

Key Mathematical Properties

The GP definition implies several fundamental properties that make GPs analytically tractable and practically powerful.

Property 1: Marginalization

If $f \sim \mathcal{GP}(m, k)$, then for any subset of inputs $\mathbf{X}_A \subset \mathbf{X}$:

$$f(\mathbf{X}_A) \sim \mathcal{N}(m(\mathbf{X}_A), k(\mathbf{X}_A, \mathbf{X}_A))$$

This follows from the marginalization property of multivariate Gaussians: dropping variables just means dropping the corresponding rows/columns from mean vector and covariance matrix.

Property 2: Conditioning

Given observations $\mathbf{f}_A = f(\mathbf{X}_A)$, the conditional distribution of $\mathbf{f}_B = f(\mathbf{X}_B)$ is:

$$\mathbf{f}B | \mathbf{f}A \sim \mathcal{N}(\boldsymbol{\mu}{B|A}, \mathbf{K}{B|A})$$

This is the GP posterior—the updated distribution after observing data.

Property 3: Linear Operations Preserve Gaussianity

If $f \sim \mathcal{GP}(m, k)$ and $g(\mathbf{x}) = \int w(\mathbf{x}, \mathbf{z}) f(\mathbf{z}) d\mathbf{z}$ is a linear functional of $f$, then $g$ is also a GP:

$$g \sim \mathcal{GP}(m_g, k_g)$$

Important special case: Derivatives

If $f \sim \mathcal{GP}(m, k)$ has sufficiently smooth sample paths, then:

$$\frac{\partial f}{\partial x_i} \sim \mathcal{GP}\left(\frac{\partial m}{\partial x_i}, \frac{\partial^2 k}{\partial x_i \partial x_i'}\right)$$

Derivatives of GPs are also GPs! This enables modeling of physical systems with gradient observations.

Summary of Key GP Properties
Property	Statement	Use Case
Marginalization	Subset distributions are Gaussian	Ignore unobserved locations freely
Conditioning	Conditional distributions are Gaussian	Posterior inference given observations
Linear Functionals	Linear operations yield GPs	Integrals, derivatives, convolutions
Affine Transformation	$af + b$ is GP if $f$ is GP	Scaling and shifting predictions
Sum of Independent	$f + g$ is GP if $f, g$ independent GPs	Combining signal and noise models

Why These Properties Matter

Stationarity and Isotropy

Many commonly used kernels have special structure that simplifies analysis and interpretation.

Stationary Kernels:

A kernel is stationary if it depends only on the difference between inputs:

$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\tau})$$

where $\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}'$ is the displacement vector.

Isotropic (Radial) Kernels:

A kernel is isotropic if it depends only on the distance between inputs:

$$k(\mathbf{x}, \mathbf{x}') = k(|\mathbf{x} - \mathbf{x}'|) = k(r)$$

where $r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance.

Physical interpretation: Correlations depend only on how far apart points are, not their direction. The function has no preferred orientation.

Common Kernel Classifications

•Stationary + Isotropic: RBF, Matérn, Rational Quadratic — depends on $|\mathbf{x} - \mathbf{x}'|$
•Stationary + Anisotropic: ARD kernels — different length scales per dimension, $k = \exp(-\sum_d (x_d - x_d')^2 / 2\ell_d^2)$
•Non-stationary: Linear kernel $\mathbf{x}^\top \mathbf{x}'$, Neural Network kernel — properties vary with location
•Periodic: $\exp(-2\sin^2(\pi|x-x'|/p) / \ell^2)$ — stationary on the circle, captures periodic patterns

Non-Stationary Kernels:

Sometimes stationarity is inappropriate—the function may behave differently in different regions. Examples:

Linear Kernel: $k(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}'$
- Variance increases away from origin
- Models functions that grow linearly
Neural Network Kernel: Derived from infinite-width neural networks
- Variance structure matches network architecture
- Non-stationary by construction
Input-Dependent Length Scales: $k(\mathbf{x}, \mathbf{x}') = \sigma^2(\mathbf{x}) \sigma^2(\mathbf{x}') \exp(-|\mathbf{x} - \mathbf{x}'|^2 / (\ell(\mathbf{x}) + \ell(\mathbf{x}'))^2)$
- Length scale varies with position
- Useful when smoothness varies across input space

Choosing Between Stationary and Non-Stationary

Sample Path Properties

The kernel determines not just correlations but also the regularity of sample functions—how smooth, continuous, or differentiable they are.

Continuity: For a zero-mean GP with stationary kernel $k(r)$, sample paths are almost surely continuous if:

$$k(0) - k(r) = O(|\log r|^{-(1+\epsilon)})$$

as $r \to 0$, for some $\epsilon > 0$. All commonly used kernels satisfy this, so GP samples are typically continuous.

Mean-Square Differentiability: A GP is mean-square differentiable if:

$$\lim_{h \to 0} \mathbb{E}\left[\left(\frac{f(x+h) - f(x)}{h} - f'(x)\right)^2\right] = 0$$

This occurs if and only if $\frac{\partial^2 k}{\partial x \partial x'}$ exists and is finite at $x = x'$.

For the RBF kernel: $$\frac{\partial^2}{\partial x \partial x'} \exp\left(-\frac{(x-x')^2}{2\ell^2}\right) \bigg|_{x=x'} = \frac{1}{\ell^2}$$

This is finite, so RBF samples are infinitely differentiable!

Kernel Smoothness and Sample Path Properties
Kernel	Smoothness Parameter	Differentiability	Sample Path Character
Exponential (Matérn 1/2)	$ u = 1/2$	Continuous, not differentiable	Rough, jagged paths
Matérn 3/2	$ u = 3/2$	Once differentiable	Moderately smooth
Matérn 5/2	$ u = 5/2$	Twice differentiable	Smooth curves
RBF (Squared Exponential)	$ u = \infty$	Infinitely differentiable	Ultra-smooth
Periodic + RBF	$ u = \infty$	Infinitely differentiable	Smooth and periodic

The Smoothness-Realism Tradeoff

kernel_smoothness_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
from scipy.special import kv, gamma
 
def rbf_kernel(x1, x2, length_scale=1.0):
    """Infinitely differentiable (nu=infinity)"""
    return np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2)
 
def matern_kernel(x1, x2, nu=2.5, length_scale=1.0):
    """Matérn kernel with smoothness parameter nu"""
    r = np.abs(np.subtract.outer(x1, x2))
    r = np.clip(r, 1e-10, None)  # Avoid division by zero
    
    if nu == 0.5:  # Exponential
        return np.exp(-r / length_scale)
    elif nu == 1.5:
        sqrt3 = np.sqrt(3)
        return (1 + sqrt3 * r / length_scale) * np.exp(-sqrt3 * r / length_scale)
    elif nu == 2.5:
        sqrt5 = np.sqrt(5)
        return (1 + sqrt5 * r / length_scale + 5 * r**2 / (3 * length_scale**2)) * \
               np.exp(-sqrt5 * r / length_scale)
    else:
        # General Matérn (computationally expensive)
        coef = (2**(1-nu)) / gamma(nu)
        arg = np.sqrt(2*nu) * r / length_scale
        return coef * (arg**nu) * kv(nu, arg)
 
# Setup
np.random.seed(42)
x = np.linspace(0, 5, 200)
n = len(x)
 
kernels = [
    ("Matérn 1/2 (rough)", lambda x1, x2: matern_kernel(x1, x2, nu=0.5)),
    ("Matérn 3/2", lambda x1, x2: matern_kernel(x1, x2, nu=1.5)),
    ("Matérn 5/2", lambda x1, x2: matern_kernel(x1, x2, nu=2.5)),
    ("RBF (ultra-smooth)", rbf_kernel),
]
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
 
for idx, (name, kernel_fn) in enumerate(kernels):
    K = kernel_fn(x, x) + 1e-8 * np.eye(n)
    L = cholesky(K, lower=True)
    
    ax = axes[idx]
    for i in range(3):
        sample = L @ np.random.randn(n)
        ax.plot(x, sample, alpha=0.8, linewidth=1.5)
    
    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('f(x)', fontsize=12)
    ax.set_title(f'{name}', fontsize=14)
    ax.grid(True, alpha=0.3)
    ax.set_ylim([-3, 3])
 
plt.suptitle('GP Sample Paths: Smoothness Depends on Kernel', fontsize=16)
plt.tight_layout()
plt.show()
 
print("Notice how sample roughness increases as nu decreases.")
print("The kernel completely determines sample path regularity!")

The Zero-Mean Assumption

Why Zero Mean Works:

Data centering: If we subtract the empirical mean from observations, the residuals have approximately zero mean. The GP then models these centered residuals.
Posterior adaptation: The GP posterior mean adapts to data, so even with zero prior mean, predictions are non-zero where observations inform us.
Far-field behavior: Where data is sparse, predictions revert to the prior mean. If you want predictions to be zero far from data (rather than some arbitrary constant), zero mean is appropriate.
Fewer hyperparameters: Adding a parametric mean function introduces more parameters to optimize. Zero mean keeps the model simpler.

When Non-Zero Mean is Appropriate:

Known trends: If physics or domain knowledge suggests a specific trend (linear growth, exponential decay), incorporate it as the mean function.
Extrapolation: Far from data, the posterior reverts to the prior mean. If you want specific extrapolation behavior, encode it in $m(\mathbf{x})$.
Hierarchical models: In some cases, the mean function itself is uncertain and given a prior.

The General Formulation:

$$f(\mathbf{x}) = m(\mathbf{x}) + g(\mathbf{x}), \quad g \sim \mathcal{GP}(0, k)$$

where $m(\mathbf{x})$ is a fixed (or parameterized) mean function and $g$ is a zero-mean GP. This decomposition separates:

Trend: captured by $m(\mathbf{x})$
Variation around trend: captured by the GP $g(\mathbf{x})$

Practical Recommendation

Standard Notation Conventions

GP literature uses consistent notation that's worth memorizing. Here's the standard convention:

Training Data:

$\mathbf{X}$: $n \times d$ matrix of training inputs (rows are data points)
$\mathbf{y}$: $n \times 1$ vector of training targets
$n$: number of training points
$d$: input dimension

Test Data:

$\mathbf{X}*$ or $\mathbf{X}'$: $n* \times d$ matrix of test inputs
$\mathbf{f}_*$: function values at test points (the quantity we want to predict)
$n_*$: number of test points

Kernel/Covariance Matrices:

$\mathbf{K}$ or $\mathbf{K}_{ff}$: kernel matrix between training points, $n \times n$
$\mathbf{K}*$ or $\mathbf{K}{f}$: kernel matrix between test and training points, $n_ \times n$
$\mathbf{K}{**}$: kernel matrix between test points, $n* \times n_*$

GP Notation Quick Reference
Symbol	Dimension	Description
$f$	function	Latent function (GP distributed)
$y$	scalar/vector	Noisy observation(s)
$\sigma_n^2$	scalar	Observation noise variance
$m(\mathbf{x})$	function → scalar	Mean function
$k(\mathbf{x}, \mathbf{x}')$	function → scalar	Covariance function (kernel)
$\boldsymbol{\theta}$	vector	Kernel hyperparameters
$\mathbf{K}$	$n \times n$	Gram matrix $[k(\mathbf{x}_i, \mathbf{x}_j)]$
$\mathbf{K}_y$	$n \times n$	$\mathbf{K} + \sigma_n^2 \mathbf{I}$ (with noise)
$\boldsymbol{\alpha}$	$n \times 1$	$\mathbf{K}_y^{-1} \mathbf{y}$ (precomputed for efficiency)

The Standard GP Prediction Equations:

Given training data $(\mathbf{X}, \mathbf{y})$ and test points $\mathbf{X}_*$, with observation model $y = f(\mathbf{x}) + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:

Posterior Mean: $$\bar{\mathbf{f}}* = \mathbf{K}* (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} = \mathbf{K}_* \boldsymbol{\alpha}$$

Posterior Covariance: $$\text{Cov}(\mathbf{f}*) = \mathbf{K}{**} - \mathbf{K}* (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}*^\top$$

These are the equations you'll use for virtually every GP application. They're derived from Gaussian conditioning (which we'll detail when discussing GP regression).

Existence and Uniqueness

The GP definition relies on deep results from probability theory that guarantee our constructions are mathematically valid.

The Kolmogorov Extension Theorem:

Let ${P_{\mathbf{x}_1, \ldots, \mathbf{x}_n}}$ be a family of probability distributions on $\mathbb{R}^n$, indexed by finite subsets of a set $\mathcal{X}$. If this family satisfies:

Consistency under permutation: Reordering variables doesn't change the distribution
Consistency under marginalization: Marginalizing matches the distribution on fewer variables

Implications for Practitioners:

Existence is guaranteed: If $k$ is a valid kernel (positive semi-definite), a GP with that kernel exists.
Uniqueness is guaranteed: Any two GPs with the same $m$ and $k$ are the same stochastic process.
Finite computation suffices: We never need to 'construct' the infinite-dimensional process. Finite-dimensional operations with consistent specifications are enough.
The kernel determines everything: Given a valid kernel, all properties of the GP (sample smoothness, long-range behavior, etc.) are fully determined.

What can go wrong:

If you specify a function $k$ that isn't positive semi-definite, the resulting 'covariance matrix' at some point set might have negative eigenvalues—an invalid covariance. Always use valid kernels or verify PSD-ness.

Mathematical Foundation

GPs vs Other Stochastic Processes

Gaussian Processes are one family in a broader landscape of stochastic processes. Understanding what makes GPs special clarifies their strengths and limitations.

Why 'Gaussian'?

The choice of Gaussian distributions isn't arbitrary—it provides unique computational advantages:

Closure under conditioning: Conditioning a Gaussian on observed variables yields another Gaussian. This makes posterior inference exact and tractable.
Closure under marginalization: Marginalizing out variables from a Gaussian yields a Gaussian. We can ignore unobserved locations without approximation.
Closure under linear operations: Sum, difference, integral, derivative of Gaussians are Gaussian. This enables modeling of complex physical relationships.
Maximum entropy: Among distributions with specified mean and variance, the Gaussian has maximum entropy. It's the 'least informative' choice given first and second moments—a form of principled agnosticism.

Comparison with Other Processes:

Process	Finite-Dim Distributions	Posterior Inference	Typical Use
Gaussian Process	Multivariate Gaussian	Exact (closed form)	Regression, optimization
Wiener Process	Gaussian (special case)	Exact	Brownian motion modeling
Poisson Process	Poisson	Various	Count data, event modeling
Cox Process	Poisson with random rate	Often intractable	Spatial point patterns
Student-t Process	Multivariate t	Exact but heavier tails	Robust regression
Dirichlet Process	Dirichlet	MCMC typically	Clustering, density estimation

Key GP Advantages:

Analytical tractability for most operations
Well-understood theory and algorithms
Principled uncertainty quantification
Flexibility through kernel design

Key GP Limitations:

Assumes Gaussian observation noise (extensions exist)
Scalability challenges ($O(n^3)$ training)
Cannot model discrete or count data directly

When GPs Excel

Summary: The GP Definition

We've now established the formal definition of Gaussian Processes and explored its mathematical implications. Here are the core takeaways:

Core Concepts Mastered

•The GP definition: A GP is a collection of random variables, any finite number of which are jointly Gaussian. This single statement implies everything about GP behavior.
•Complete specification via m and k: The mean function and kernel completely determine the GP. No additional parameters or distributions are needed.
•Gaussian closure properties: Marginalization, conditioning, and linear operations all preserve Gaussianity—making GP computations exact and tractable.
•Stationarity and isotropy: These structural assumptions simplify kernels and interpretation, though non-stationary alternatives exist.
•Sample path smoothness: The kernel determines function regularity. RBF gives ultra-smooth samples; Matérn provides tunable smoothness.
•Zero-mean convention: Common and practical after data centering, though parametric means can encode known trends.
•Kolmogorov guarantee: The extension theorem ensures GPs exist and are unique given valid kernels.

What's Next

Page Complete

3 / 5