Loading learning content...
Having developed intuition from both function space and weight space perspectives, we now state the formal definition of a Gaussian Process. This definition is mathematically precise yet beautifully simple—it captures everything essential about GPs in a single statement.
Understanding this definition deeply is crucial: every GP computation, every algorithm, every application ultimately derives from this foundational statement.
A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
— Rasmussen & Williams, Gaussian Processes for Machine Learning (2006)
This definition is deceptively compact. Let's unpack each component:
'Collection of random variables': The random variables are the function values ${f(\mathbf{x}) : \mathbf{x} \in \mathcal{X}}$ for all inputs in some domain $\mathcal{X}$. There are potentially uncountably many such variables (one for every possible input).
'Any finite number': We can pick any finite subset of inputs ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$ and consider the corresponding function values ${f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)}$.
'Joint Gaussian distribution': The vector $[f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top$ follows a multivariate Gaussian distribution for any choice of inputs and any $n$.
By the end of this page, you will completely understand the GP definition, its mathematical implications, the notation conventions, and how this definition enables tractable computation over infinite-dimensional function spaces. You'll master the specification of GPs through their mean and covariance functions.
A GP is completely specified by two functions: a mean function and a covariance function (also called the kernel).
Mean Function $m: \mathcal{X} \to \mathbb{R}$: $$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$
This gives the expected value of the function at any input. It represents our prior belief about the 'central' function before seeing data.
Covariance Function (Kernel) $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$: $$k(\mathbf{x}, \mathbf{x}') = \text{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \mathbb{E}[(f(\mathbf{x}) - m(\mathbf{x}))(f(\mathbf{x}') - m(\mathbf{x}'))]$$
This gives the covariance between function values at any two inputs. It encodes our prior beliefs about function properties: smoothness, periodicity, variation scale.
Standard Notation: We write: $$f \sim \mathcal{GP}(m, k)$$
to denote that $f$ is distributed as a Gaussian Process with mean function $m$ and covariance function $k$.
To fully define a GP model, you need only specify m(x) and k(x, x'). Everything else—priors, posteriors, predictions, uncertainty—follows from manipulating multivariate Gaussians. The art of GP modeling is choosing m and k to reflect your prior beliefs about the function you're trying to learn.
Given a GP $f \sim \mathcal{GP}(m, k)$ and any finite set of inputs $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the GP definition tells us:
$$\mathbf{f} = \begin{bmatrix} f(\mathbf{x}_1) \ f(\mathbf{x}_2) \ \vdots \ f(\mathbf{x}_n) \end{bmatrix} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})$$
where:
Mean vector: $$\boldsymbol{\mu} = \begin{bmatrix} m(\mathbf{x}_1) \ m(\mathbf{x}_2) \ \vdots \ m(\mathbf{x}_n) \end{bmatrix}$$
Covariance matrix (Gram matrix): $$\mathbf{K} = \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \cdots & k(\mathbf{x}_2, \mathbf{x}_n) \ \vdots & \vdots & \ddots & \vdots \ k(\mathbf{x}_n, \mathbf{x}_1) & k(\mathbf{x}_n, \mathbf{x}_2) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{bmatrix}$$
Properties of the Gram Matrix:
Symmetric: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$, so $\mathbf{K} = \mathbf{K}^\top$
Positive semi-definite: For any vector $\mathbf{a}$, $\mathbf{a}^\top \mathbf{K} \mathbf{a} \geq 0$. This is required for $\mathbf{K}$ to be a valid covariance matrix.
Diagonal gives variances: $K_{ii} = k(\mathbf{x}_i, \mathbf{x}_i) = \text{Var}[f(\mathbf{x}_i)]$
Off-diagonal gives correlations: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ measures how correlated function values are at different inputs
Compact Notation: $$\mathbf{K} = k(\mathbf{X}, \mathbf{X})$$
where $k(\mathbf{X}, \mathbf{X}')$ denotes the matrix with entries $[k(\mathbf{X}, \mathbf{X}')]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j')$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom scipy.linalg import choleskyimport matplotlib.pyplot as plt # Define a Gaussian Processclass GaussianProcess: """ A Gaussian Process specified by mean and covariance functions. f ~ GP(m, k) For any finite set of inputs X, f(X) ~ N(m(X), k(X, X)) """ def __init__(self, mean_fn, kernel_fn): """ Args: mean_fn: m(x) -> scalar mean at input x kernel_fn: k(x1, x2) -> covariance between inputs """ self.mean_fn = mean_fn self.kernel_fn = kernel_fn def mean_vector(self, X): """Compute mean vector μ = [m(x₁), ..., m(xₙ)]ᵀ""" return np.array([self.mean_fn(x) for x in X]) def gram_matrix(self, X1, X2=None): """ Compute Gram matrix K with entries K[i,j] = k(X1[i], X2[j]) If X2 is None, computes k(X1, X1) """ if X2 is None: X2 = X1 n1, n2 = len(X1), len(X2) K = np.zeros((n1, n2)) for i in range(n1): for j in range(n2): K[i, j] = self.kernel_fn(X1[i], X2[j]) return K def sample(self, X, n_samples=1): """ Sample functions from GP at inputs X. Returns: (n_samples, len(X)) array of function values """ mu = self.mean_vector(X) K = self.gram_matrix(X) + 1e-8 * np.eye(len(X)) # jitter for stability L = cholesky(K, lower=True) samples = [] for _ in range(n_samples): z = np.random.randn(len(X)) f = mu + L @ z # f ~ N(μ, K) samples.append(f) return np.array(samples) # Example: GP with zero mean and RBF kerneldef zero_mean(x): return 0.0 def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0): return variance * np.exp(-0.5 * (x1 - x2)**2 / length_scale**2) # Create GPgp = GaussianProcess( mean_fn=zero_mean, kernel_fn=lambda x1, x2: rbf_kernel(x1, x2, length_scale=1.0, variance=1.0)) # Sample at finite pointsX = np.linspace(0, 5, 100)samples = gp.sample(X, n_samples=5) # Visualizeplt.figure(figsize=(12, 5))for i, sample in enumerate(samples): plt.plot(X, sample, label=f'Sample {i+1}')plt.xlabel('x')plt.ylabel('f(x)')plt.title('GP Definition: Samples from f ~ GP(0, k_RBF)')plt.legend()plt.grid(True, alpha=0.3)plt.show() print("Each sample is drawn from the n-dimensional Gaussian N(μ, K)")print(f"Here n = {len(X)}, so we're sampling from a {len(X)}-dim Gaussian.")The GP definition implies several fundamental properties that make GPs analytically tractable and practically powerful.
Property 1: Marginalization
If $f \sim \mathcal{GP}(m, k)$, then for any subset of inputs $\mathbf{X}_A \subset \mathbf{X}$:
$$f(\mathbf{X}_A) \sim \mathcal{N}(m(\mathbf{X}_A), k(\mathbf{X}_A, \mathbf{X}_A))$$
This follows from the marginalization property of multivariate Gaussians: dropping variables just means dropping the corresponding rows/columns from mean vector and covariance matrix.
Property 2: Conditioning
Given observations $\mathbf{f}_A = f(\mathbf{X}_A)$, the conditional distribution of $\mathbf{f}_B = f(\mathbf{X}_B)$ is:
$$\mathbf{f}B | \mathbf{f}A \sim \mathcal{N}(\boldsymbol{\mu}{B|A}, \mathbf{K}{B|A})$$
where: $$\boldsymbol{\mu}{B|A} = m(\mathbf{X}B) + \mathbf{K}{BA} \mathbf{K}{AA}^{-1} (\mathbf{f}A - m(\mathbf{X}A))$$ $$\mathbf{K}{B|A} = \mathbf{K}{BB} - \mathbf{K}{BA} \mathbf{K}{AA}^{-1} \mathbf{K}_{AB}$$
This is the GP posterior—the updated distribution after observing data.
Property 3: Linear Operations Preserve Gaussianity
If $f \sim \mathcal{GP}(m, k)$ and $g(\mathbf{x}) = \int w(\mathbf{x}, \mathbf{z}) f(\mathbf{z}) d\mathbf{z}$ is a linear functional of $f$, then $g$ is also a GP:
$$g \sim \mathcal{GP}(m_g, k_g)$$
with: $$m_g(\mathbf{x}) = \int w(\mathbf{x}, \mathbf{z}) m(\mathbf{z}) d\mathbf{z}$$ $$k_g(\mathbf{x}, \mathbf{x}') = \iint w(\mathbf{x}, \mathbf{z}) k(\mathbf{z}, \mathbf{z}') w(\mathbf{x}', \mathbf{z}') d\mathbf{z} d\mathbf{z}'$$
Important special case: Derivatives
If $f \sim \mathcal{GP}(m, k)$ has sufficiently smooth sample paths, then:
$$\frac{\partial f}{\partial x_i} \sim \mathcal{GP}\left(\frac{\partial m}{\partial x_i}, \frac{\partial^2 k}{\partial x_i \partial x_i'}\right)$$
Derivatives of GPs are also GPs! This enables modeling of physical systems with gradient observations.
| Property | Statement | Use Case |
|---|---|---|
| Marginalization | Subset distributions are Gaussian | Ignore unobserved locations freely |
| Conditioning | Conditional distributions are Gaussian | Posterior inference given observations |
| Linear Functionals | Linear operations yield GPs | Integrals, derivatives, convolutions |
| Affine Transformation | $af + b$ is GP if $f$ is GP | Scaling and shifting predictions |
| Sum of Independent | $f + g$ is GP if $f, g$ independent GPs | Combining signal and noise models |
These properties are why GPs are computationally tractable. Unlike arbitrary distributions over functions, Gaussians have closed-form marginals, conditionals, and linear transformations. Every GP calculation reduces to multivariate Gaussian algebra—matrix multiplications and inversions—which we know how to compute efficiently.
Many commonly used kernels have special structure that simplifies analysis and interpretation.
Stationary Kernels:
A kernel is stationary if it depends only on the difference between inputs:
$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}') = k(\boldsymbol{\tau})$$
where $\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}'$ is the displacement vector.
Physical interpretation: The statistical properties of the function (variance, correlations) don't change as we 'shift' through input space. The function looks statistically similar at $x = 0$, $x = 100$, or anywhere else.
Isotropic (Radial) Kernels:
A kernel is isotropic if it depends only on the distance between inputs:
$$k(\mathbf{x}, \mathbf{x}') = k(|\mathbf{x} - \mathbf{x}'|) = k(r)$$
where $r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance.
Physical interpretation: Correlations depend only on how far apart points are, not their direction. The function has no preferred orientation.
Non-Stationary Kernels:
Sometimes stationarity is inappropriate—the function may behave differently in different regions. Examples:
Linear Kernel: $k(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}'$
Neural Network Kernel: Derived from infinite-width neural networks
Input-Dependent Length Scales: $k(\mathbf{x}, \mathbf{x}') = \sigma^2(\mathbf{x}) \sigma^2(\mathbf{x}') \exp(-|\mathbf{x} - \mathbf{x}'|^2 / (\ell(\mathbf{x}) + \ell(\mathbf{x}'))^2)$
Use stationary kernels as a default—they're simpler, have fewer parameters, and work well for many problems. Switch to non-stationary kernels when you have evidence that function properties genuinely vary across the input domain: trend lines that grow over time, signals that become more variable in certain regions, or physical systems with spatially varying characteristics.
The kernel determines not just correlations but also the regularity of sample functions—how smooth, continuous, or differentiable they are.
Continuity: For a zero-mean GP with stationary kernel $k(r)$, sample paths are almost surely continuous if:
$$k(0) - k(r) = O(|\log r|^{-(1+\epsilon)})$$
as $r \to 0$, for some $\epsilon > 0$. All commonly used kernels satisfy this, so GP samples are typically continuous.
Mean-Square Differentiability: A GP is mean-square differentiable if:
$$\lim_{h \to 0} \mathbb{E}\left[\left(\frac{f(x+h) - f(x)}{h} - f'(x)\right)^2\right] = 0$$
This occurs if and only if $\frac{\partial^2 k}{\partial x \partial x'}$ exists and is finite at $x = x'$.
For the RBF kernel: $$\frac{\partial^2}{\partial x \partial x'} \exp\left(-\frac{(x-x')^2}{2\ell^2}\right) \bigg|_{x=x'} = \frac{1}{\ell^2}$$
This is finite, so RBF samples are infinitely differentiable!
| Kernel | Smoothness Parameter | Differentiability | Sample Path Character |
|---|---|---|---|
| Exponential (Matérn 1/2) | $ u = 1/2$ | Continuous, not differentiable | Rough, jagged paths |
| Matérn 3/2 | $ u = 3/2$ | Once differentiable | Moderately smooth |
| Matérn 5/2 | $ u = 5/2$ | Twice differentiable | Smooth curves |
| RBF (Squared Exponential) | $ u = \infty$ | Infinitely differentiable | Ultra-smooth |
| Periodic + RBF | $ u = \infty$ | Infinitely differentiable | Smooth and periodic |
The RBF kernel produces extremely smooth samples—infinitely differentiable everywhere. This is often unrealistic for physical phenomena. Financial data, weather patterns, and biological signals typically have finite smoothness. The Matérn family provides a smoothness parameter ν that lets you match the realism of your model to the data. Matérn-5/2 is a popular 'sweet spot' for many applications.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import choleskyfrom scipy.special import kv, gamma def rbf_kernel(x1, x2, length_scale=1.0): """Infinitely differentiable (nu=infinity)""" return np.exp(-0.5 * np.subtract.outer(x1, x2)**2 / length_scale**2) def matern_kernel(x1, x2, nu=2.5, length_scale=1.0): """Matérn kernel with smoothness parameter nu""" r = np.abs(np.subtract.outer(x1, x2)) r = np.clip(r, 1e-10, None) # Avoid division by zero if nu == 0.5: # Exponential return np.exp(-r / length_scale) elif nu == 1.5: sqrt3 = np.sqrt(3) return (1 + sqrt3 * r / length_scale) * np.exp(-sqrt3 * r / length_scale) elif nu == 2.5: sqrt5 = np.sqrt(5) return (1 + sqrt5 * r / length_scale + 5 * r**2 / (3 * length_scale**2)) * \ np.exp(-sqrt5 * r / length_scale) else: # General Matérn (computationally expensive) coef = (2**(1-nu)) / gamma(nu) arg = np.sqrt(2*nu) * r / length_scale return coef * (arg**nu) * kv(nu, arg) # Setupnp.random.seed(42)x = np.linspace(0, 5, 200)n = len(x) kernels = [ ("Matérn 1/2 (rough)", lambda x1, x2: matern_kernel(x1, x2, nu=0.5)), ("Matérn 3/2", lambda x1, x2: matern_kernel(x1, x2, nu=1.5)), ("Matérn 5/2", lambda x1, x2: matern_kernel(x1, x2, nu=2.5)), ("RBF (ultra-smooth)", rbf_kernel),] fig, axes = plt.subplots(2, 2, figsize=(14, 10))axes = axes.flatten() for idx, (name, kernel_fn) in enumerate(kernels): K = kernel_fn(x, x) + 1e-8 * np.eye(n) L = cholesky(K, lower=True) ax = axes[idx] for i in range(3): sample = L @ np.random.randn(n) ax.plot(x, sample, alpha=0.8, linewidth=1.5) ax.set_xlabel('x', fontsize=12) ax.set_ylabel('f(x)', fontsize=12) ax.set_title(f'{name}', fontsize=14) ax.grid(True, alpha=0.3) ax.set_ylim([-3, 3]) plt.suptitle('GP Sample Paths: Smoothness Depends on Kernel', fontsize=16)plt.tight_layout()plt.show() print("Notice how sample roughness increases as nu decreases.")print("The kernel completely determines sample path regularity!")In practice, most GP models use a zero mean function: $m(\mathbf{x}) = 0$. This might seem restrictive—surely real functions aren't centered at zero everywhere?—but there are good reasons for this choice.
Why Zero Mean Works:
Data centering: If we subtract the empirical mean from observations, the residuals have approximately zero mean. The GP then models these centered residuals.
Posterior adaptation: The GP posterior mean adapts to data, so even with zero prior mean, predictions are non-zero where observations inform us.
Far-field behavior: Where data is sparse, predictions revert to the prior mean. If you want predictions to be zero far from data (rather than some arbitrary constant), zero mean is appropriate.
Fewer hyperparameters: Adding a parametric mean function introduces more parameters to optimize. Zero mean keeps the model simpler.
When Non-Zero Mean is Appropriate:
Known trends: If physics or domain knowledge suggests a specific trend (linear growth, exponential decay), incorporate it as the mean function.
Extrapolation: Far from data, the posterior reverts to the prior mean. If you want specific extrapolation behavior, encode it in $m(\mathbf{x})$.
Hierarchical models: In some cases, the mean function itself is uncertain and given a prior.
The General Formulation:
$$f(\mathbf{x}) = m(\mathbf{x}) + g(\mathbf{x}), \quad g \sim \mathcal{GP}(0, k)$$
where $m(\mathbf{x})$ is a fixed (or parameterized) mean function and $g$ is a zero-mean GP. This decomposition separates:
Start with a zero-mean GP after centering your data. If model fit is poor or extrapolation is unrealistic, consider adding a parametric mean (linear, polynomial, or domain-specific). The mean function lets you inject structural knowledge; the GP captures everything else.
GP literature uses consistent notation that's worth memorizing. Here's the standard convention:
Training Data:
Test Data:
Kernel/Covariance Matrices:
| Symbol | Dimension | Description |
|---|---|---|
| $f$ | function | Latent function (GP distributed) |
| $y$ | scalar/vector | Noisy observation(s) |
| $\sigma_n^2$ | scalar | Observation noise variance |
| $m(\mathbf{x})$ | function → scalar | Mean function |
| $k(\mathbf{x}, \mathbf{x}')$ | function → scalar | Covariance function (kernel) |
| $\boldsymbol{\theta}$ | vector | Kernel hyperparameters |
| $\mathbf{K}$ | $n \times n$ | Gram matrix $[k(\mathbf{x}_i, \mathbf{x}_j)]$ |
| $\mathbf{K}_y$ | $n \times n$ | $\mathbf{K} + \sigma_n^2 \mathbf{I}$ (with noise) |
| $\boldsymbol{\alpha}$ | $n \times 1$ | $\mathbf{K}_y^{-1} \mathbf{y}$ (precomputed for efficiency) |
The Standard GP Prediction Equations:
Given training data $(\mathbf{X}, \mathbf{y})$ and test points $\mathbf{X}_*$, with observation model $y = f(\mathbf{x}) + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:
Posterior Mean: $$\bar{\mathbf{f}}* = \mathbf{K}* (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} = \mathbf{K}_* \boldsymbol{\alpha}$$
Posterior Covariance: $$\text{Cov}(\mathbf{f}*) = \mathbf{K}{**} - \mathbf{K}* (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}*^\top$$
These are the equations you'll use for virtually every GP application. They're derived from Gaussian conditioning (which we'll detail when discussing GP regression).
The GP definition relies on deep results from probability theory that guarantee our constructions are mathematically valid.
The Kolmogorov Extension Theorem:
Let ${P_{\mathbf{x}_1, \ldots, \mathbf{x}_n}}$ be a family of probability distributions on $\mathbb{R}^n$, indexed by finite subsets of a set $\mathcal{X}$. If this family satisfies:
Then there exists a unique probability measure on $\mathbb{R}^{\mathcal{X}}$ (the space of all functions from $\mathcal{X}$ to $\mathbb{R}$) whose finite-dimensional marginals are exactly the given distributions.
For GPs: We specify multivariate Gaussians for all finite subsets, determined by $m$ and $k$. Gaussians automatically satisfy consistency (marginals of Gaussians are Gaussian with the right parameters). Therefore, a unique GP exists!
Implications for Practitioners:
Existence is guaranteed: If $k$ is a valid kernel (positive semi-definite), a GP with that kernel exists.
Uniqueness is guaranteed: Any two GPs with the same $m$ and $k$ are the same stochastic process.
Finite computation suffices: We never need to 'construct' the infinite-dimensional process. Finite-dimensional operations with consistent specifications are enough.
The kernel determines everything: Given a valid kernel, all properties of the GP (sample smoothness, long-range behavior, etc.) are fully determined.
What can go wrong:
Kolmogorov's theorem is what makes GP theory rigorous. Without it, 'distribution over functions' would be hand-waving. With it, we have precise mathematical objects with well-defined properties. The theorem ensures that our finite-dimensional intuitions (sampling, conditioning, marginalization) extend coherently to the full infinite-dimensional setting.
Gaussian Processes are one family in a broader landscape of stochastic processes. Understanding what makes GPs special clarifies their strengths and limitations.
Why 'Gaussian'?
The choice of Gaussian distributions isn't arbitrary—it provides unique computational advantages:
Closure under conditioning: Conditioning a Gaussian on observed variables yields another Gaussian. This makes posterior inference exact and tractable.
Closure under marginalization: Marginalizing out variables from a Gaussian yields a Gaussian. We can ignore unobserved locations without approximation.
Closure under linear operations: Sum, difference, integral, derivative of Gaussians are Gaussian. This enables modeling of complex physical relationships.
Maximum entropy: Among distributions with specified mean and variance, the Gaussian has maximum entropy. It's the 'least informative' choice given first and second moments—a form of principled agnosticism.
Comparison with Other Processes:
| Process | Finite-Dim Distributions | Posterior Inference | Typical Use |
|---|---|---|---|
| Gaussian Process | Multivariate Gaussian | Exact (closed form) | Regression, optimization |
| Wiener Process | Gaussian (special case) | Exact | Brownian motion modeling |
| Poisson Process | Poisson | Various | Count data, event modeling |
| Cox Process | Poisson with random rate | Often intractable | Spatial point patterns |
| Student-t Process | Multivariate t | Exact but heavier tails | Robust regression |
| Dirichlet Process | Dirichlet | MCMC typically | Clustering, density estimation |
Key GP Advantages:
Key GP Limitations:
GPs are ideal when: (1) you need uncertainty quantification, not just predictions; (2) data is limited and you can't afford overconfident extrapolation; (3) function properties like smoothness are known or can be learned; (4) the problem involves continuous inputs and outputs. For other scenarios, consider extensions (sparse GPs, non-Gaussian likelihoods) or alternative models.
We've now established the formal definition of Gaussian Processes and explored its mathematical implications. Here are the core takeaways:
With the GP definition firmly established, we're ready to explore the critical choice in any GP model: the mean and covariance functions. The next page dives deep into common kernels, their properties, and how to choose and combine them to encode your prior beliefs about the function you're trying to learn.
You now have rigorous understanding of the Gaussian Process definition and its mathematical properties. This foundational knowledge underlies all GP algorithms and applications. Proceed to learn about the mean and covariance functions that give GPs their expressive power.