Gp Regression - Learning Module

Loading content...

0/278

Noise Handling

Embracing Imperfect Observations

Real-world data is noisy. Measurements have errors. Labels are imperfect. Observations are corrupted. Any practical regression method must gracefully handle this reality.

Gaussian Process regression incorporates observation noise in a principled, probabilistic way. Rather than treating noise as a nuisance to be eliminated, the GP framework treats it as a fundamental aspect of the observation model. This leads to smoothing rather than interpolation: the GP no longer passes exactly through training points but instead finds a smooth function that balances fidelity to data against prior smoothness assumptions.

The noise model has profound implications. It affects prediction accuracy, uncertainty calibration, and computational stability. It determines whether we trust individual observations or smooth over them. It enables separation of irreducible aleatoric uncertainty from reducible epistemic uncertainty.

In this page, we develop a complete understanding of noise handling in GP regression—from the basic homoscedastic model to sophisticated heteroscedastic extensions.

What You Will Learn

By the end of this page, you will: (1) Understand how observation noise modifies the GP posterior, (2) Distinguish homoscedastic from heteroscedastic noise, (3) Estimate noise variance from data, (4) Recognize the interpolation-smoothing tradeoff, (5) Implement robust noise handling in practice.

The Observation Model with Noise

The standard GP regression observation model assumes:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where:

$f(\mathbf{x})$ is the latent (unobserved) function we want to learn
$\epsilon_i$ is observation noise, independently and identically distributed (i.i.d.)
$\sigma_n^2$ is the noise variance (a hyperparameter)

This model says each observation is the true function value plus Gaussian noise. The noise terms are:

Independent: Noise at different observations is uncorrelated
Zero-mean: On average, noise doesn't shift observations
Constant variance: The same noise level everywhere (homoscedastic)

Prior on Observations

Combining the GP prior $f \sim \mathcal{GP}(0, k)$ with the noise model, the observations follow:

$$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{0}, K + \sigma_n^2 I)$$

The covariance matrix becomes $K + \sigma_n^2 I$: we add the noise variance to the diagonal. This is sometimes called the 'noisy' kernel:

$$k_y(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}, \mathbf{x}') + \sigma_n^2 \delta_{\mathbf{x}, \mathbf{x}'}$$

where $\delta$ is the Kronecker delta (1 if $\mathbf{x} = \mathbf{x}'$, 0 otherwise).

Noise Is a Diagonal Contribution

Noise adds to the diagonal only because the noise at different observations is independent. The off-diagonal entries of K capture function correlation; noise doesn't add correlation between different observations.

Two Covariance Matrices

It's important to distinguish:

Symbol	Formula	Meaning
$K$	$k(\mathbf{X}, \mathbf{X})$	Covariance of latent function $f$
$K_y$	$K + \sigma_n^2 I$	Covariance of noisy observations $y$

For prediction:

To predict the latent function $f_$, use $K_y$ to invert training covariance, but test covariance $K(\mathbf{X}_, \mathbf{X}_*)$ remains noise-free
To predict a new noisy observation $y_*$, add $\sigma_n^2$ to the predictive variance

This distinction is crucial for uncertainty interpretation.

The Noisy Posterior: Modified Equations

With observation noise, the posterior distribution of the latent function at test points becomes:

$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\bar{\boldsymbol{\mu}}_, \bar{\Sigma}_*)$$

where:

$$\bar{\boldsymbol{\mu}}* = K(\mathbf{X}*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1} \mathbf{y}$$

$$\bar{\Sigma}* = K(\mathbf{X}, \mathbf{X}_) - K(\mathbf{X}*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1} K(\mathbf{X}, \mathbf{X}*)$$

Comparing to the noise-free case, the only change is the regularizing term $\sigma_n^2 I$ added to the training covariance matrix before inversion.

Effect on Computation

Adding σ_n²I to the diagonal makes K_y = K + σ_n²I strictly positive definite even when K itself is ill-conditioned. This is why noise addition is numerically stabilizing—it's effectively Tikhonov regularization on the matrix inversion.

Key Differences from Noise-Free Case

No exact interpolation: The posterior mean no longer passes exactly through training points. There's a 'smoothing' effect.
Non-zero variance at training points: Even at $\mathbf{x}_i$, the posterior variance is positive (though small). We're uncertain about the true function value because the observation was noisy.
Decreased influence of individual points: With high noise, individual observations have less influence on predictions. The GP averages over nearby noisy observations.

Posterior Variance at Training Points

At a training point $\mathbf{x}_i$, the posterior variance of the latent function $f$ is:

$$\text{Var}(f(\mathbf{x}_i) | \mathcal{D}) > 0$$

This reflects that we observed a noisy version of $f(\mathbf{x}_i)$, not $f(\mathbf{x}_i)$ itself. As $\sigma_n^2 \to 0$, this variance approaches zero and we recover interpolation.

Predictive Distribution for Future Observations

If we want to predict a new noisy observation $y_$ rather than the latent function $f_$:

$$y_* | \mathbf{X}*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\bar{\mu}, \bar{\sigma}_^2 + \sigma_n^2)$$

The predictive variance includes both:

$\bar{\sigma}_*^2$: Uncertainty about the function (epistemic)
$\sigma_n^2$: Irreducible observation noise (aleatoric)

noisy_gp_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict(X_train, y_train, X_test, kernel_fn, noise_var=0.0, jitter=1e-8):
    """
    GP prediction with configurable noise variance.
    """
    n = X_train.shape[0]
    
    K = kernel_fn(X_train, X_train)
    K_y = K + (noise_var + jitter) * np.eye(n)
    K_star = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    
    L = cholesky(K_y, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    
    mu = K_star.T @ alpha
    
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_ss) - np.sum(V**2, axis=0)  # Latent function variance
    var_y = var_f + noise_var  # Noisy observation variance
    
    return mu, var_f, var_y
 
# Demonstrate the effect of noise
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Training data with actual noise
X_train = np.linspace(-3, 3, 10).reshape(-1, 1)
f_true = np.sin(X_train.flatten())
y_train = f_true + 0.3 * np.random.randn(10)  # Noisy observations
 
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)
 
print("Effect of noise variance on predictions at training points:")
print("-" * 60)
 
for noise_var in [0.0, 0.01, 0.09, 0.25]:
    mu, var_f, var_y = gp_predict(X_train, y_train, X_train, rbf, noise_var=noise_var)
    
    # Residual at training points
    residual = np.mean(np.abs(mu - y_train))
    mean_var = np.mean(var_f)
    
    print(f"σ_n² = {noise_var:.2f}: Mean |residual| = {residual:.4f}, "
          f"Mean Var(f) at train pts = {mean_var:.4f}")
 
print("
Note: Higher noise → larger residuals (smoothing) and higher variance at train pts")
 
# Compare variance types at a test point
mu, var_f, var_y = gp_predict(X_train, y_train, np.array([[0.5]]), rbf, noise_var=0.09)
print(f"
At x=0.5 with σ_n²=0.09:")
print(f"  Var(f*) = {var_f[0]:.4f} (epistemic uncertainty)")
print(f"  Var(y*) = {var_y[0]:.4f} (epistemic + aleatoric = {var_f[0]:.4f} + 0.09)")

Interpolation vs Smoothing: A Fundamental Tradeoff

The noise parameter $\sigma_n^2$ controls a fundamental tradeoff:

Interpolation ($\sigma_n^2 = 0$): The GP passes exactly through all training points. This is appropriate when observations are exact (e.g., function evaluations in simulation).

Smoothing ($\sigma_n^2 > 0$): The GP balances fitting the data against maintaining smoothness. Higher noise means more smoothing.

This tradeoff can be understood through the lens of regularization or Bayesian model selection.

Interpolation vs Smoothing
Aspect	Interpolation (σ_n²=0)	Smoothing (σ_n²>0)
Passes through data	Exactly	Approximately
Posterior variance at data	Zero	Positive
Response to outliers	Captures all noise	Smooths over outliers
Function sample behavior	Constrained at data points	Loose at data points
Appropriate for	Noise-free simulations	Noisy measurements

Interpolation Can Be Dangerous

Forcing interpolation (σ_n²=0) on noisy data leads to overfitting: predictions oscillate wildly between data points to hit each one exactly. The resulting function may have high posterior uncertainty between points even though it has zero uncertainty at the points themselves.

Visualizing the Tradeoff

Imagine training points that are noisy measurements of a smooth underlying function:

With $\sigma_n^2 = 0$, the GP forces itself through each noisy point, creating wiggles that aren't present in the true function.
With appropriate $\sigma_n^2$, the GP finds a smooth compromise that doesn't hit any point exactly but better captures the underlying trend.
With very large $\sigma_n^2$, the GP over-smooths, essentially reverting to the prior mean and ignoring the data.

Mathematical Perspective

The smoothing effect can be understood through the limiting cases:

$$\lim_{\sigma_n^2 \to 0} K_y^{-1} = K^{-1}$$ (interpolation)

$$\lim_{\sigma_n^2 \to \infty} K_y^{-1} = \frac{1}{\sigma_n^2} I$$ (ignoring correlations, simple averaging)

The optimal $\sigma_n^2$ balances these extremes based on the actual noise level in the data.

Estimating Noise Variance from Data

In practice, the noise variance $\sigma_n^2$ is rarely known a priori. It must be estimated from the data, typically alongside other kernel hyperparameters.

Maximum Marginal Likelihood

The standard approach optimizes the log marginal likelihood (evidence):

$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$

where $\boldsymbol{\theta}$ includes all hyperparameters: kernel parameters ${\sigma_f^2, \ell, ...}$ and noise variance $\sigma_n^2$.

The gradient with respect to $\sigma_n^2$ is:

$$\frac{\partial \log p(\mathbf{y})}{\partial \sigma_n^2} = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha} \boldsymbol{\alpha}^\top - K_y^{-1}) I \right) = \frac{1}{2} \left( |\boldsymbol{\alpha}|^2 - \text{tr}(K_y^{-1}) \right)$$

where $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$.

Intuition for the Gradient

The gradient balances two terms: ||α||² (data fit term, wants smaller noise to fit better) vs tr(K_y⁻¹) (complexity penalty, wants larger noise for simpler model). The marginal likelihood automatically balances data fidelity against model complexity.

Practical Considerations for Noise Estimation

Bounds: Noise variance should be positive. Optimize $\log \sigma_n^2$ or use constrained optimization.
Identifiability: With too few data points, noise and signal variance may be confounded. You can't distinguish between 'large noise on a smooth function' and 'small noise on a wiggly function'.
Local optima: The marginal likelihood is non-convex. Initialize with a sensible estimate (e.g., based on residual variance from a simple model) and try multiple restarts.
Priors on noise: For robust estimation, place a prior on $\sigma_n^2$ (e.g., log-normal) and maximize the penalized likelihood.

Cross-Validation Alternative

Instead of marginal likelihood, use leave-one-out cross-validation:

$$\text{LOO-CV} = \sum_{i=1}^n \log p(y_i | \mathbf{y}_{-i}, \boldsymbol{\theta})$$

This can be computed efficiently using the 'virtual leave-one-out' formula without refitting the GP $n$ times.

noise_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy.linalg import cholesky, solve_triangular, cho_solve
from scipy.optimize import minimize
 
def log_marginal_likelihood(theta, X, y, kernel_fn):
    """
    Compute negative log marginal likelihood for optimization.
    
    theta = [log_length_scale, log_signal_var, log_noise_var]
    """
    l = np.exp(theta[0])
    s2 = np.exp(theta[1])
    noise_var = np.exp(theta[2])
    
    n = len(y)
    
    # Build kernel matrix
    K = kernel_fn(X, X, l=l, s=s2)
    K_y = K + noise_var * np.eye(n) + 1e-8 * np.eye(n)
    
    try:
        L = cholesky(K_y, lower=True)
    except np.linalg.LinAlgError:
        return np.inf  # Not positive definite
    
    # alpha = K_y^{-1} @ y
    alpha = cho_solve((L, True), y)
    
    # Log marginal likelihood
    log_det = 2 * np.sum(np.log(np.diag(L)))
    data_fit = -0.5 * y @ alpha
    complexity = -0.5 * log_det
    const = -0.5 * n * np.log(2 * np.pi)
    
    lml = data_fit + complexity + const
    
    return -lml  # Negative for minimization
 
def estimate_hyperparameters(X, y, kernel_fn, n_restarts=5):
    """
    Estimate kernel hyperparameters including noise variance by maximizing
    marginal likelihood.
    """
    n = len(y)
    best_nlml = np.inf
    best_theta = None
    
    for _ in range(n_restarts):
        # Random initialization
        theta0 = np.random.randn(3)  # log-space
        
        result = minimize(
            log_marginal_likelihood,
            theta0,
            args=(X, y, kernel_fn),
            method='L-BFGS-B',
            bounds=[(-5, 5), (-5, 5), (-10, 2)]  # Reasonable bounds
        )
        
        if result.fun < best_nlml:
            best_nlml = result.fun
            best_theta = result.x
    
    return {
        'length_scale': np.exp(best_theta[0]),
        'signal_variance': np.exp(best_theta[1]),
        'noise_variance': np.exp(best_theta[2]),
        'neg_log_marginal_likelihood': best_nlml
    }
 
# Example: Estimate noise on synthetic data
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# True parameters
true_l = 1.0
true_s2 = 1.0
true_noise = 0.2
 
# Generate data
X = np.random.uniform(-3, 3, 30).reshape(-1, 1)
f_true = np.sin(X.flatten())
y = f_true + np.sqrt(true_noise) * np.random.randn(30)
 
print(f"True hyperparameters: l={true_l}, σ_f²={true_s2}, σ_n²={true_noise}")
print()
 
# Estimate
result = estimate_hyperparameters(X, y, rbf, n_restarts=3)
 
print("Estimated hyperparameters:")
print(f"  Length scale: {result['length_scale']:.3f} (true: {true_l})")
print(f"  Signal variance: {result['signal_variance']:.3f} (true: {true_s2})")
print(f"  Noise variance: {result['noise_variance']:.3f} (true: {true_noise})")

Heteroscedastic Noise: Input-Dependent Variance

The standard GP model assumes homoscedastic noise: constant variance $\sigma_n^2$ everywhere. In many real-world scenarios, noise varies with input:

Scientific experiments have higher measurement error at extreme conditions
Financial volatility varies with market regime
Sensor noise depends on operating conditions

This is heteroscedastic noise: $\text{Var}(\epsilon_i) = \sigma_n^2(\mathbf{x}_i)$, a function of input.

Modeling Heteroscedastic Noise

Several approaches exist:

1. Known Noise Function: If $\sigma_n^2(\mathbf{x})$ is known (e.g., from instrument specifications):

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2(\mathbf{x}_i))$$

The covariance becomes $K + \text{diag}([\sigma_n^2(\mathbf{x}_1), \ldots, \sigma_n^2(\mathbf{x}_n)])$.

2. Learned Noise Function: Model $\log \sigma_n^2(\mathbf{x})$ with another GP:

$$g(\mathbf{x}) = \log \sigma_n^2(\mathbf{x}), \quad g \sim \mathcal{GP}$$

This requires approximate inference (variational or MCMC) since the joint model is non-Gaussian.

Why Log-Transform Noise?

We model log σ_n²(x) rather than σ_n²(x) directly because: (1) variance must be positive—log transform ensures this, (2) variance often varies over orders of magnitude—log scale is natural, (3) the resulting noise is log-normal distributed, a common real-world pattern.

3. Most Likely Heteroscedastic GP (MLHGP):

A practical approach iteratively:

Fit a GP to get residuals
Fit a second GP to squared residuals to model noise variance
Refit the first GP with learned noise variances
Iterate until convergence

4. Stochastic Variational Heteroscedastic GP:

Place a GP prior on both the mean function and the log noise: $$f \sim \mathcal{GP}(0, k_f), \quad g = \log \sigma_n^2 \sim \mathcal{GP}(\mu_g, k_g)$$

Use variational inference with inducing points for scalability.

Impact of Heteroscedasticity

Ignoring heteroscedasticity can lead to:

Overconfident predictions in high-noise regions (underestimated variance)
Underconfident predictions in low-noise regions (overestimated variance)
Biased mean predictions (high-noise points inappropriately influence predictions elsewhere)

heteroscedastic_gp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict_heteroscedastic(X_train, y_train, X_test, kernel_fn, 
                                 noise_variances, jitter=1e-8):
    """
    GP prediction with known heteroscedastic noise.
    
    Parameters:
        noise_variances: Array of noise variance at each training point (n,)
    """
    n = X_train.shape[0]
    
    K = kernel_fn(X_train, X_train)
    K_y = K + np.diag(noise_variances) + jitter * np.eye(n)
    K_star = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    
    L = cholesky(K_y, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    
    mu = K_star.T @ alpha
    
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_ss) - np.sum(V**2, axis=0)
    
    return mu, var_f
 
# Example: Heteroscedastic data
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Training data with input-dependent noise
X_train = np.linspace(-3, 3, 20).reshape(-1, 1)
f_true = np.sin(X_train.flatten())
 
# Noise increases with |x|
true_noise_std = 0.1 + 0.2 * np.abs(X_train.flatten())
true_noise_var = true_noise_std**2
y_train = f_true + true_noise_std * np.random.randn(20)
 
X_test = np.linspace(-3.5, 3.5, 100).reshape(-1, 1)
 
# Compare homoscedastic vs heteroscedastic
# Homoscedastic: use mean noise variance
mean_noise = np.mean(true_noise_var)
 
from scipy.linalg import cholesky, solve_triangular
def gp_predict_homo(X_train, y_train, X_test, kernel_fn, noise_var):
    n = X_train.shape[0]
    K = kernel_fn(X_train, X_train)
    K_y = K + noise_var * np.eye(n)
    K_star = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    L = cholesky(K_y + 1e-8*np.eye(n), lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    mu = K_star.T @ alpha
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_ss) - np.sum(V**2, axis=0)
    return mu, var_f
 
mu_homo, var_homo = gp_predict_homo(X_train, y_train, X_test, rbf, mean_noise)
mu_hetero, var_hetero = gp_predict_heteroscedastic(X_train, y_train, X_test, rbf, true_noise_var)
 
print("Comparison at x=0 (low noise region) and x=3 (high noise region):")
idx_0 = 50  # x ≈ 0
idx_3 = 92  # x ≈ 3
 
for name, mu, var in [('Homoscedastic', mu_homo, var_homo), ('Heteroscedastic', mu_hetero, var_hetero)]:
    print(f"
{name}:")
    print(f"  At x≈0: μ={mu[idx_0]:.3f}, σ={np.sqrt(var[idx_0]):.3f}")
    print(f"  At x≈3: μ={mu[idx_3]:.3f}, σ={np.sqrt(var[idx_3]):.3f}")

The Nugget Effect and Numerical Jitter

Two related concepts—the nugget effect and jitter—both involve adding small values to the kernel diagonal, but for different reasons.

Nugget Effect (Geostatistics Term)

In geostatistics, the 'nugget' represents discontinuous variation at zero distance. It accounts for:

Microscale variability below measurement resolution
Measurement error
Sub-grid variability

Mathematically, the nugget modifies the kernel: $$k_{\text{nugget}}(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}, \mathbf{x}') + \sigma_{\text{nugget}}^2 \cdot \delta(\mathbf{x}, \mathbf{x}')$$

This is equivalent to our observation noise $\sigma_n^2$. The nugget is a physically meaningful quantity.

Numerical Jitter

Jitter is a small value (typically $10^{-6}$ to $10^{-10}$) added to the diagonal for numerical stability: $$K_{\text{stable}} = K + \epsilon_{\text{jitter}} I$$

Jitter prevents Cholesky decomposition failures when:

Training points are nearly coincident
Kernel parameters make $K$ nearly singular
Floating-point errors accumulate

Jitter Is Not Noise

Jitter should be small enough to not materially affect predictions—it's a numerical hack, not a modeling choice. If you need jitter > 10⁻⁴ for stability, your kernel is likely ill-specified. True noise should be modeled explicitly via σ_n². Conflating jitter and noise leads to incorrect uncertainty estimates.

Choosing Jitter

Situation	Recommended Jitter
Well-conditioned kernel	$10^{-10}$ or less
Moderate conditioning issues	$10^{-8}$ to $10^{-6}$
Difficult problems	$10^{-6}$ to $10^{-4}$
Severe singularity	Diagnose the problem first!

Diagnosing the Need for Large Jitter

If you need large jitter, investigate:

Near-duplicate training points: Points with $|\mathbf{x}_i - \mathbf{x}_j| \approx 0$ create near-singularity. Solution: deduplicate or average observations.
Length scale too small: When $\ell \ll$ typical point spacing, the kernel matrix approaches identity but may have numerical issues. Solution: increase $\ell$.
Length scale too large: When $\ell \gg$ domain size, all points look identical, creating rank deficiency. Solution: decrease $\ell$.
Mixed scales in features: Features with very different magnitudes create conditioning problems. Solution: standardize inputs.

When Noise and Jitter Combine

In practice, the effective diagonal addition is: $$K_y = K + \sigma_n^2 I + \epsilon_{\text{jitter}} I = K + (\sigma_n^2 + \epsilon_{\text{jitter}}) I$$

For noisy data with $\sigma_n^2 > 10^{-4}$, the jitter is negligible. For noise-free data, jitter is essential for stability.

Robust Noise Handling: Outliers and Heavy Tails

The Gaussian noise assumption $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$ is convenient but not always realistic. Real data often has:

Outliers: Observations far from the true function due to measurement errors or data entry mistakes
Heavy-tailed noise: Noise with occasional large deviations beyond Gaussian expectations

Gaussian noise is not robust to outliers. A single outlier can dramatically distort the posterior mean because the GP tries hard to accommodate it.

Alternative Noise Models

1. Student-t Likelihood: $$p(y | f) = \mathcal{T}(y | f, \sigma, u)$$

with degrees of freedom $ u$ controlling tail heaviness. As $ u \to \infty$, this approaches Gaussian. Smaller $ u$ (e.g., 3-5) accommodates outliers.

2. Laplace (Double Exponential) Likelihood: $$p(y | f) \propto \exp(-|y - f| / b)$$

Corresponds to L1 loss, inherently more robust.

3. Huber Likelihood: Quadratic near the mode, linear in the tails—a smooth transition between L2 and L1.

Non-Gaussian Likelihood Requires Approximate Inference

With non-Gaussian likelihoods, the posterior is no longer Gaussian and doesn't have a closed form. You need approximate inference: Laplace approximation, Expectation Propagation, or MCMC sampling. This increases computational cost but provides robustness.

Practical Approaches Without Non-Gaussian Likelihoods

If you want to stay with Gaussian inference for simplicity:

Outlier Detection and Removal: Identify outliers by leave-one-out prediction error, remove or downweight them.
Winsorization: Cap extreme observations at percentile limits.
Incremental Noise Variance: Give suspected outliers larger individual noise variances (heteroscedastic approach).
Mixture Noise Model: Model noise as a mixture of Gaussian (inliers) and uniform (outliers): $$p(\epsilon) = (1 - \pi) \mathcal{N}(0, \sigma_n^2) + \pi \mathcal{U}(-B, B)$$

Effect of Outliers

A single outlier with Gaussian likelihood:

Pulls the posterior mean toward it
May reduce posterior variance (the GP is confident in its distorted fit)
Can affect predictions at distant test points through kernel correlations

This is why outlier handling is important before GP fitting, or robust likelihoods should be used.

Summary: Noise Handling in GP Regression

We have developed a comprehensive understanding of noise in Gaussian Process regression. Let us consolidate:

Key Takeaways

•Noise adds to the diagonal: The observation covariance is K_y = K + σ_n²I. This enables GP smoothing rather than interpolation.
•Two types of uncertainty: Epistemic (about f, reducible) and aleatoric (observation noise, irreducible). These combine in predictions for y*.
•Noise estimation via marginal likelihood: Optimize σ_n² alongside kernel parameters by maximizing log p(y|X,θ).
•Heteroscedastic noise is common: Input-dependent noise requires specialized models (known functions, learned GP, iterative methods).
•Jitter ≠ noise: Jitter is a numerical stability hack (10⁻⁸); noise is physical model parameter. Keep them conceptually separate.
•Outliers require robust methods: Gaussian likelihood is not outlier-robust. Consider Student-t, Laplace, or preprocessing.

What's Next

We have covered posterior derivation, predictive distributions, mean/variance prediction, and noise handling. In the final page of this module, we bring everything together with the closed-form solution—a complete, implementable algorithm for GP regression that you can apply to real problems.

Page Complete

You now understand how Gaussian Processes handle observation noise—from the basic homoscedastic model to heteroscedastic extensions and robust alternatives. This knowledge is essential for applying GPs to real-world data where perfect measurements are rare.

Noise Handling

Embracing Imperfect Observations

Real-world data is noisy. Measurements have errors. Labels are imperfect. Observations are corrupted. Any practical regression method must gracefully handle this reality.

In this page, we develop a complete understanding of noise handling in GP regression—from the basic homoscedastic model to sophisticated heteroscedastic extensions.

What You Will Learn

The Observation Model with Noise

The standard GP regression observation model assumes:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where:

$f(\mathbf{x})$ is the latent (unobserved) function we want to learn
$\epsilon_i$ is observation noise, independently and identically distributed (i.i.d.)
$\sigma_n^2$ is the noise variance (a hyperparameter)

This model says each observation is the true function value plus Gaussian noise. The noise terms are:

Independent: Noise at different observations is uncorrelated
Zero-mean: On average, noise doesn't shift observations
Constant variance: The same noise level everywhere (homoscedastic)

Prior on Observations

Combining the GP prior $f \sim \mathcal{GP}(0, k)$ with the noise model, the observations follow:

$$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{0}, K + \sigma_n^2 I)$$

The covariance matrix becomes $K + \sigma_n^2 I$: we add the noise variance to the diagonal. This is sometimes called the 'noisy' kernel:

$$k_y(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}, \mathbf{x}') + \sigma_n^2 \delta_{\mathbf{x}, \mathbf{x}'}$$

where $\delta$ is the Kronecker delta (1 if $\mathbf{x} = \mathbf{x}'$, 0 otherwise).

Noise Is a Diagonal Contribution

Two Covariance Matrices

It's important to distinguish:

Symbol	Formula	Meaning
$K$	$k(\mathbf{X}, \mathbf{X})$	Covariance of latent function $f$
$K_y$	$K + \sigma_n^2 I$	Covariance of noisy observations $y$

For prediction:

To predict the latent function $f_$, use $K_y$ to invert training covariance, but test covariance $K(\mathbf{X}_, \mathbf{X}_*)$ remains noise-free
To predict a new noisy observation $y_*$, add $\sigma_n^2$ to the predictive variance

This distinction is crucial for uncertainty interpretation.

The Noisy Posterior: Modified Equations

With observation noise, the posterior distribution of the latent function at test points becomes:

$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\bar{\boldsymbol{\mu}}_, \bar{\Sigma}_*)$$

where:

$$\bar{\boldsymbol{\mu}}* = K(\mathbf{X}*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1} \mathbf{y}$$

$$\bar{\Sigma}* = K(\mathbf{X}, \mathbf{X}_) - K(\mathbf{X}*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1} K(\mathbf{X}, \mathbf{X}*)$$

Comparing to the noise-free case, the only change is the regularizing term $\sigma_n^2 I$ added to the training covariance matrix before inversion.

Effect on Computation

Key Differences from Noise-Free Case

No exact interpolation: The posterior mean no longer passes exactly through training points. There's a 'smoothing' effect.
Non-zero variance at training points: Even at $\mathbf{x}_i$, the posterior variance is positive (though small). We're uncertain about the true function value because the observation was noisy.
Decreased influence of individual points: With high noise, individual observations have less influence on predictions. The GP averages over nearby noisy observations.

Posterior Variance at Training Points

At a training point $\mathbf{x}_i$, the posterior variance of the latent function $f$ is:

$$\text{Var}(f(\mathbf{x}_i) | \mathcal{D}) > 0$$

This reflects that we observed a noisy version of $f(\mathbf{x}_i)$, not $f(\mathbf{x}_i)$ itself. As $\sigma_n^2 \to 0$, this variance approaches zero and we recover interpolation.

Predictive Distribution for Future Observations

If we want to predict a new noisy observation $y_$ rather than the latent function $f_$:

$$y_* | \mathbf{X}*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\bar{\mu}, \bar{\sigma}_^2 + \sigma_n^2)$$

The predictive variance includes both:

$\bar{\sigma}_*^2$: Uncertainty about the function (epistemic)
$\sigma_n^2$: Irreducible observation noise (aleatoric)

noisy_gp_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict(X_train, y_train, X_test, kernel_fn, noise_var=0.0, jitter=1e-8):
    """
    GP prediction with configurable noise variance.
    """
    n = X_train.shape[0]
    
    K = kernel_fn(X_train, X_train)
    K_y = K + (noise_var + jitter) * np.eye(n)
    K_star = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    
    L = cholesky(K_y, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    
    mu = K_star.T @ alpha
    
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_ss) - np.sum(V**2, axis=0)  # Latent function variance
    var_y = var_f + noise_var  # Noisy observation variance
    
    return mu, var_f, var_y
 
# Demonstrate the effect of noise
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Training data with actual noise
X_train = np.linspace(-3, 3, 10).reshape(-1, 1)
f_true = np.sin(X_train.flatten())
y_train = f_true + 0.3 * np.random.randn(10)  # Noisy observations
 
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)
 
print("Effect of noise variance on predictions at training points:")
print("-" * 60)
 
for noise_var in [0.0, 0.01, 0.09, 0.25]:
    mu, var_f, var_y = gp_predict(X_train, y_train, X_train, rbf, noise_var=noise_var)
    
    # Residual at training points
    residual = np.mean(np.abs(mu - y_train))
    mean_var = np.mean(var_f)
    
    print(f"σ_n² = {noise_var:.2f}: Mean |residual| = {residual:.4f}, "
          f"Mean Var(f) at train pts = {mean_var:.4f}")
 
print("
Note: Higher noise → larger residuals (smoothing) and higher variance at train pts")
 
# Compare variance types at a test point
mu, var_f, var_y = gp_predict(X_train, y_train, np.array([[0.5]]), rbf, noise_var=0.09)
print(f"
At x=0.5 with σ_n²=0.09:")
print(f"  Var(f*) = {var_f[0]:.4f} (epistemic uncertainty)")
print(f"  Var(y*) = {var_y[0]:.4f} (epistemic + aleatoric = {var_f[0]:.4f} + 0.09)")

Interpolation vs Smoothing: A Fundamental Tradeoff

The noise parameter $\sigma_n^2$ controls a fundamental tradeoff:

Interpolation ($\sigma_n^2 = 0$): The GP passes exactly through all training points. This is appropriate when observations are exact (e.g., function evaluations in simulation).

Smoothing ($\sigma_n^2 > 0$): The GP balances fitting the data against maintaining smoothness. Higher noise means more smoothing.

This tradeoff can be understood through the lens of regularization or Bayesian model selection.

Interpolation vs Smoothing
Aspect	Interpolation (σ_n²=0)	Smoothing (σ_n²>0)
Passes through data	Exactly	Approximately
Posterior variance at data	Zero	Positive
Response to outliers	Captures all noise	Smooths over outliers
Function sample behavior	Constrained at data points	Loose at data points
Appropriate for	Noise-free simulations	Noisy measurements

Interpolation Can Be Dangerous

Visualizing the Tradeoff

Imagine training points that are noisy measurements of a smooth underlying function:

With $\sigma_n^2 = 0$, the GP forces itself through each noisy point, creating wiggles that aren't present in the true function.
With appropriate $\sigma_n^2$, the GP finds a smooth compromise that doesn't hit any point exactly but better captures the underlying trend.
With very large $\sigma_n^2$, the GP over-smooths, essentially reverting to the prior mean and ignoring the data.

Mathematical Perspective

The smoothing effect can be understood through the limiting cases:

$$\lim_{\sigma_n^2 \to 0} K_y^{-1} = K^{-1}$$ (interpolation)

$$\lim_{\sigma_n^2 \to \infty} K_y^{-1} = \frac{1}{\sigma_n^2} I$$ (ignoring correlations, simple averaging)

The optimal $\sigma_n^2$ balances these extremes based on the actual noise level in the data.

Estimating Noise Variance from Data

In practice, the noise variance $\sigma_n^2$ is rarely known a priori. It must be estimated from the data, typically alongside other kernel hyperparameters.

Maximum Marginal Likelihood

The standard approach optimizes the log marginal likelihood (evidence):

$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$

where $\boldsymbol{\theta}$ includes all hyperparameters: kernel parameters ${\sigma_f^2, \ell, ...}$ and noise variance $\sigma_n^2$.

The gradient with respect to $\sigma_n^2$ is:

where $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$.

Intuition for the Gradient

Practical Considerations for Noise Estimation

Bounds: Noise variance should be positive. Optimize $\log \sigma_n^2$ or use constrained optimization.
Identifiability: With too few data points, noise and signal variance may be confounded. You can't distinguish between 'large noise on a smooth function' and 'small noise on a wiggly function'.
Local optima: The marginal likelihood is non-convex. Initialize with a sensible estimate (e.g., based on residual variance from a simple model) and try multiple restarts.
Priors on noise: For robust estimation, place a prior on $\sigma_n^2$ (e.g., log-normal) and maximize the penalized likelihood.

Cross-Validation Alternative

Instead of marginal likelihood, use leave-one-out cross-validation:

$$\text{LOO-CV} = \sum_{i=1}^n \log p(y_i | \mathbf{y}_{-i}, \boldsymbol{\theta})$$

This can be computed efficiently using the 'virtual leave-one-out' formula without refitting the GP $n$ times.

noise_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy.linalg import cholesky, solve_triangular, cho_solve
from scipy.optimize import minimize
 
def log_marginal_likelihood(theta, X, y, kernel_fn):
    """
    Compute negative log marginal likelihood for optimization.
    
    theta = [log_length_scale, log_signal_var, log_noise_var]
    """
    l = np.exp(theta[0])
    s2 = np.exp(theta[1])
    noise_var = np.exp(theta[2])
    
    n = len(y)
    
    # Build kernel matrix
    K = kernel_fn(X, X, l=l, s=s2)
    K_y = K + noise_var * np.eye(n) + 1e-8 * np.eye(n)
    
    try:
        L = cholesky(K_y, lower=True)
    except np.linalg.LinAlgError:
        return np.inf  # Not positive definite
    
    # alpha = K_y^{-1} @ y
    alpha = cho_solve((L, True), y)
    
    # Log marginal likelihood
    log_det = 2 * np.sum(np.log(np.diag(L)))
    data_fit = -0.5 * y @ alpha
    complexity = -0.5 * log_det
    const = -0.5 * n * np.log(2 * np.pi)
    
    lml = data_fit + complexity + const
    
    return -lml  # Negative for minimization
 
def estimate_hyperparameters(X, y, kernel_fn, n_restarts=5):
    """
    Estimate kernel hyperparameters including noise variance by maximizing
    marginal likelihood.
    """
    n = len(y)
    best_nlml = np.inf
    best_theta = None
    
    for _ in range(n_restarts):
        # Random initialization
        theta0 = np.random.randn(3)  # log-space
        
        result = minimize(
            log_marginal_likelihood,
            theta0,
            args=(X, y, kernel_fn),
            method='L-BFGS-B',
            bounds=[(-5, 5), (-5, 5), (-10, 2)]  # Reasonable bounds
        )
        
        if result.fun < best_nlml:
            best_nlml = result.fun
            best_theta = result.x
    
    return {
        'length_scale': np.exp(best_theta[0]),
        'signal_variance': np.exp(best_theta[1]),
        'noise_variance': np.exp(best_theta[2]),
        'neg_log_marginal_likelihood': best_nlml
    }
 
# Example: Estimate noise on synthetic data
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# True parameters
true_l = 1.0
true_s2 = 1.0
true_noise = 0.2
 
# Generate data
X = np.random.uniform(-3, 3, 30).reshape(-1, 1)
f_true = np.sin(X.flatten())
y = f_true + np.sqrt(true_noise) * np.random.randn(30)
 
print(f"True hyperparameters: l={true_l}, σ_f²={true_s2}, σ_n²={true_noise}")
print()
 
# Estimate
result = estimate_hyperparameters(X, y, rbf, n_restarts=3)
 
print("Estimated hyperparameters:")
print(f"  Length scale: {result['length_scale']:.3f} (true: {true_l})")
print(f"  Signal variance: {result['signal_variance']:.3f} (true: {true_s2})")
print(f"  Noise variance: {result['noise_variance']:.3f} (true: {true_noise})")

Heteroscedastic Noise: Input-Dependent Variance

The standard GP model assumes homoscedastic noise: constant variance $\sigma_n^2$ everywhere. In many real-world scenarios, noise varies with input:

Scientific experiments have higher measurement error at extreme conditions
Financial volatility varies with market regime
Sensor noise depends on operating conditions

This is heteroscedastic noise: $\text{Var}(\epsilon_i) = \sigma_n^2(\mathbf{x}_i)$, a function of input.

Modeling Heteroscedastic Noise

Several approaches exist:

1. Known Noise Function: If $\sigma_n^2(\mathbf{x})$ is known (e.g., from instrument specifications):

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2(\mathbf{x}_i))$$

The covariance becomes $K + \text{diag}([\sigma_n^2(\mathbf{x}_1), \ldots, \sigma_n^2(\mathbf{x}_n)])$.

2. Learned Noise Function: Model $\log \sigma_n^2(\mathbf{x})$ with another GP:

$$g(\mathbf{x}) = \log \sigma_n^2(\mathbf{x}), \quad g \sim \mathcal{GP}$$

This requires approximate inference (variational or MCMC) since the joint model is non-Gaussian.

Why Log-Transform Noise?

3. Most Likely Heteroscedastic GP (MLHGP):

A practical approach iteratively:

Fit a GP to get residuals
Fit a second GP to squared residuals to model noise variance
Refit the first GP with learned noise variances
Iterate until convergence

4. Stochastic Variational Heteroscedastic GP:

Place a GP prior on both the mean function and the log noise: $$f \sim \mathcal{GP}(0, k_f), \quad g = \log \sigma_n^2 \sim \mathcal{GP}(\mu_g, k_g)$$

Use variational inference with inducing points for scalability.

Impact of Heteroscedasticity

Ignoring heteroscedasticity can lead to:

Overconfident predictions in high-noise regions (underestimated variance)
Underconfident predictions in low-noise regions (overestimated variance)
Biased mean predictions (high-noise points inappropriately influence predictions elsewhere)

heteroscedastic_gp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict_heteroscedastic(X_train, y_train, X_test, kernel_fn, 
                                 noise_variances, jitter=1e-8):
    """
    GP prediction with known heteroscedastic noise.
    
    Parameters:
        noise_variances: Array of noise variance at each training point (n,)
    """
    n = X_train.shape[0]
    
    K = kernel_fn(X_train, X_train)
    K_y = K + np.diag(noise_variances) + jitter * np.eye(n)
    K_star = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    
    L = cholesky(K_y, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    
    mu = K_star.T @ alpha
    
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_ss) - np.sum(V**2, axis=0)
    
    return mu, var_f
 
# Example: Heteroscedastic data
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Training data with input-dependent noise
X_train = np.linspace(-3, 3, 20).reshape(-1, 1)
f_true = np.sin(X_train.flatten())
 
# Noise increases with |x|
true_noise_std = 0.1 + 0.2 * np.abs(X_train.flatten())
true_noise_var = true_noise_std**2
y_train = f_true + true_noise_std * np.random.randn(20)
 
X_test = np.linspace(-3.5, 3.5, 100).reshape(-1, 1)
 
# Compare homoscedastic vs heteroscedastic
# Homoscedastic: use mean noise variance
mean_noise = np.mean(true_noise_var)
 
from scipy.linalg import cholesky, solve_triangular
def gp_predict_homo(X_train, y_train, X_test, kernel_fn, noise_var):
    n = X_train.shape[0]
    K = kernel_fn(X_train, X_train)
    K_y = K + noise_var * np.eye(n)
    K_star = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    L = cholesky(K_y + 1e-8*np.eye(n), lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    mu = K_star.T @ alpha
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_ss) - np.sum(V**2, axis=0)
    return mu, var_f
 
mu_homo, var_homo = gp_predict_homo(X_train, y_train, X_test, rbf, mean_noise)
mu_hetero, var_hetero = gp_predict_heteroscedastic(X_train, y_train, X_test, rbf, true_noise_var)
 
print("Comparison at x=0 (low noise region) and x=3 (high noise region):")
idx_0 = 50  # x ≈ 0
idx_3 = 92  # x ≈ 3
 
for name, mu, var in [('Homoscedastic', mu_homo, var_homo), ('Heteroscedastic', mu_hetero, var_hetero)]:
    print(f"
{name}:")
    print(f"  At x≈0: μ={mu[idx_0]:.3f}, σ={np.sqrt(var[idx_0]):.3f}")
    print(f"  At x≈3: μ={mu[idx_3]:.3f}, σ={np.sqrt(var[idx_3]):.3f}")

The Nugget Effect and Numerical Jitter

Two related concepts—the nugget effect and jitter—both involve adding small values to the kernel diagonal, but for different reasons.

Nugget Effect (Geostatistics Term)

In geostatistics, the 'nugget' represents discontinuous variation at zero distance. It accounts for:

Microscale variability below measurement resolution
Measurement error
Sub-grid variability

Mathematically, the nugget modifies the kernel: $$k_{\text{nugget}}(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}, \mathbf{x}') + \sigma_{\text{nugget}}^2 \cdot \delta(\mathbf{x}, \mathbf{x}')$$

This is equivalent to our observation noise $\sigma_n^2$. The nugget is a physically meaningful quantity.

Numerical Jitter

Jitter is a small value (typically $10^{-6}$ to $10^{-10}$) added to the diagonal for numerical stability: $$K_{\text{stable}} = K + \epsilon_{\text{jitter}} I$$

Jitter prevents Cholesky decomposition failures when:

Training points are nearly coincident
Kernel parameters make $K$ nearly singular
Floating-point errors accumulate

Jitter Is Not Noise

Choosing Jitter

Situation	Recommended Jitter
Well-conditioned kernel	$10^{-10}$ or less
Moderate conditioning issues	$10^{-8}$ to $10^{-6}$
Difficult problems	$10^{-6}$ to $10^{-4}$
Severe singularity	Diagnose the problem first!

Diagnosing the Need for Large Jitter

If you need large jitter, investigate:

Near-duplicate training points: Points with $|\mathbf{x}_i - \mathbf{x}_j| \approx 0$ create near-singularity. Solution: deduplicate or average observations.
Length scale too small: When $\ell \ll$ typical point spacing, the kernel matrix approaches identity but may have numerical issues. Solution: increase $\ell$.
Length scale too large: When $\ell \gg$ domain size, all points look identical, creating rank deficiency. Solution: decrease $\ell$.
Mixed scales in features: Features with very different magnitudes create conditioning problems. Solution: standardize inputs.

When Noise and Jitter Combine

In practice, the effective diagonal addition is: $$K_y = K + \sigma_n^2 I + \epsilon_{\text{jitter}} I = K + (\sigma_n^2 + \epsilon_{\text{jitter}}) I$$

For noisy data with $\sigma_n^2 > 10^{-4}$, the jitter is negligible. For noise-free data, jitter is essential for stability.

Robust Noise Handling: Outliers and Heavy Tails

The Gaussian noise assumption $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$ is convenient but not always realistic. Real data often has:

Outliers: Observations far from the true function due to measurement errors or data entry mistakes
Heavy-tailed noise: Noise with occasional large deviations beyond Gaussian expectations

Gaussian noise is not robust to outliers. A single outlier can dramatically distort the posterior mean because the GP tries hard to accommodate it.

Alternative Noise Models

1. Student-t Likelihood: $$p(y | f) = \mathcal{T}(y | f, \sigma, u)$$

with degrees of freedom $ u$ controlling tail heaviness. As $ u \to \infty$, this approaches Gaussian. Smaller $ u$ (e.g., 3-5) accommodates outliers.

2. Laplace (Double Exponential) Likelihood: $$p(y | f) \propto \exp(-|y - f| / b)$$

Corresponds to L1 loss, inherently more robust.

3. Huber Likelihood: Quadratic near the mode, linear in the tails—a smooth transition between L2 and L1.

Non-Gaussian Likelihood Requires Approximate Inference

Practical Approaches Without Non-Gaussian Likelihoods

If you want to stay with Gaussian inference for simplicity:

Outlier Detection and Removal: Identify outliers by leave-one-out prediction error, remove or downweight them.
Winsorization: Cap extreme observations at percentile limits.
Incremental Noise Variance: Give suspected outliers larger individual noise variances (heteroscedastic approach).
Mixture Noise Model: Model noise as a mixture of Gaussian (inliers) and uniform (outliers): $$p(\epsilon) = (1 - \pi) \mathcal{N}(0, \sigma_n^2) + \pi \mathcal{U}(-B, B)$$

Effect of Outliers

A single outlier with Gaussian likelihood:

Pulls the posterior mean toward it
May reduce posterior variance (the GP is confident in its distorted fit)
Can affect predictions at distant test points through kernel correlations

This is why outlier handling is important before GP fitting, or robust likelihoods should be used.

Summary: Noise Handling in GP Regression

We have developed a comprehensive understanding of noise in Gaussian Process regression. Let us consolidate:

Key Takeaways

•Noise adds to the diagonal: The observation covariance is K_y = K + σ_n²I. This enables GP smoothing rather than interpolation.
•Two types of uncertainty: Epistemic (about f, reducible) and aleatoric (observation noise, irreducible). These combine in predictions for y*.
•Noise estimation via marginal likelihood: Optimize σ_n² alongside kernel parameters by maximizing log p(y|X,θ).
•Heteroscedastic noise is common: Input-dependent noise requires specialized models (known functions, learned GP, iterative methods).
•Jitter ≠ noise: Jitter is a numerical stability hack (10⁻⁸); noise is physical model parameter. Keep them conceptually separate.
•Outliers require robust methods: Gaussian likelihood is not outlier-robust. Consider Student-t, Laplace, or preprocessing.

What's Next

Page Complete