Gaussian Processes Introduction - Learning Module

Loading content...

0/278

Mean and Variance Prediction

Beyond Point Predictions

Most machine learning models produce point predictions: a single "best guess" for each input. Gaussian Processes go further—they provide complete predictive distributions that quantify uncertainty at every point.

This uncertainty-aware prediction is not a luxury; it's essential for real-world applications. When a GP predicts temperature will be 25°C with standard deviation of 0.5°C, that's fundamentally different from predicting 25°C with standard deviation of 10°C. The mean is the same, but the implications for decision-making are vastly different.

This page explores how to interpret, visualize, and leverage the full predictive distribution from GP regression. We'll move beyond seeing uncertainty as a curiosity to understanding it as actionable information.

What You Will Learn

By the end of this page, you will understand how to interpret the posterior mean as a prediction and the posterior variance as uncertainty, how to construct and visualize confidence/credible intervals, the distinction between epistemic and observation uncertainty, and how to use predictive distributions for decision-making and active learning.

Interpreting the Posterior Mean

The posterior mean $\mu_*(x)$ serves as our "best estimate" of the function value at input $x$. But what exactly does "best" mean?

Optimal Under Squared Error Loss:

The posterior mean minimizes expected squared error: $$\mu_* = \arg\min_a \mathbb{E}[(f(x_*) - a)^2 | \mathcal{D}]$$

If we must commit to a single prediction and will be penalized by squared error, the mean is optimal.

As a Weighted Combination:

We showed earlier that: $$\mu_* = m(x_) + \mathbf{k}_^T (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})$$

This reveals the structure: the prediction starts from the prior mean and adds a correction based on the discrepancy between observations and prior expectations, weighted by correlations.

Key Properties of the Posterior Mean:

Posterior Mean Properties

•Interpolation/Smoothing — With zero noise, the mean passes exactly through training points. With noise, it smooths over them.
•Reverts to Prior — Far from data, the posterior mean approaches the prior mean. The GP "forgets" data beyond the kernel's effective range.
•Linear in Observations — The posterior mean is a linear function of $\mathbf{y}$. Doubling all $y$ values doubles all predictions (with zero prior mean).
•Continuous and Smooth — The posterior mean inherits smoothness from the kernel. SE kernel → infinitely smooth predictions.
•Data-Efficient Updates — Adding one observation changes predictions via a rank-1 update, enabling efficient sequential learning.

posterior_mean_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict(X_train, y_train, X_test, length_scale=1.0, 
               signal_var=1.0, noise_var=0.1):
    """Standard GP prediction returning mean and std."""
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
np.random.seed(42)
 
# Demonstration 1: Mean as optimal estimate
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Generate data
X_train = np.array([[-2], [0], [1], [3]])
y_train = np.array([0.5, 0, 1, 0.5])
X_test = np.linspace(-4, 5, 200).reshape(-1, 1)
 
mean, std = gp_predict(X_train, y_train, X_test, length_scale=1.0, noise_var=0.01)
 
# Plot: Mean as central prediction
ax = axes[0, 0]
ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')
ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')
ax.scatter(X_train, y_train, c='red', s=80, zorder=5, label='Data')
ax.axhline(0, color='gray', linestyle='--', alpha=0.3)
ax.set_title('Posterior Mean: Best Estimate Under L2 Loss')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot: Reversion to prior far from data
ax = axes[0, 1]
X_train_sparse = np.array([[0]])
y_train_sparse = np.array([2])
mean_sparse, std_sparse = gp_predict(X_train_sparse, y_train_sparse, X_test, 
                                      length_scale=1.5, noise_var=0.01)
 
ax.fill_between(X_test.ravel(), mean_sparse-2*std_sparse, mean_sparse+2*std_sparse, 
                 alpha=0.2, color='C0')
ax.plot(X_test, mean_sparse, 'C0-', linewidth=2, label='Posterior mean')
ax.axhline(0, color='C1', linestyle='--', linewidth=2, label='Prior mean (= 0)')
ax.scatter(X_train_sparse, y_train_sparse, c='red', s=100, zorder=5, label='Single observation')
ax.set_title('Mean Reverts to Prior Far from Data')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot: Linearity in observations
ax = axes[1, 0]
y1 = np.array([1, 0, 1])
y2 = 2 * y1  # Double the observations
X_train_3 = np.array([[-1], [0], [1]])
 
mean1, _ = gp_predict(X_train_3, y1, X_test, noise_var=0.01)
mean2, _ = gp_predict(X_train_3, y2, X_test, noise_var=0.01)
 
ax.plot(X_test, mean1, 'C0-', linewidth=2, label='Original y')
ax.plot(X_test, mean2, 'C1--', linewidth=2, label='Doubled y')
ax.plot(X_test, 2*mean1, 'k:', linewidth=2, label='2 × Original mean')
ax.scatter(X_train_3, y1, c='C0', s=60, zorder=5)
ax.scatter(X_train_3, y2, c='C1', s=60, zorder=5)
ax.set_title('Mean is Linear in Observations')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot: Smoothness from kernel
ax = axes[1, 1]
X_train_4 = np.array([[-2], [0], [2]])
y_train_4 = np.array([0, 1, 0])
 
for ls, style, label in [(0.5, '-', 'ℓ=0.5 (less smooth)'), 
                          (2.0, '--', 'ℓ=2.0 (smoother)')]:
    mean_ls, _ = gp_predict(X_train_4, y_train_4, X_test, length_scale=ls, noise_var=0.01)
    ax.plot(X_test, mean_ls, style, linewidth=2, label=label)
 
ax.scatter(X_train_4, y_train_4, c='red', s=80, zorder=5, label='Data')
ax.set_title('Smoothness Controlled by Length Scale')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('Properties of the Posterior Mean', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Interpreting the Posterior Variance

The posterior variance $\sigma_*^2(x)$ quantifies our uncertainty about the function value at each input. This is arguably the most valuable output of a GP—the ability to say not just "I predict 25" but "I predict 25 ± 2 with 95% confidence."

Understanding the Variance Formula:

$$\sigma_^2 = k(x_, x_) - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$$

First term $k(x_, x_)$: Prior variance at $x_*$
Second term: Variance reduction from observations

The variance can never be negative (the second term is bounded by the first), so observations always reduce uncertainty—never increase it.

Key Properties of the Posterior Variance:

Posterior Variance Properties

•Independent of $\mathbf{y}$ — The variance depends only on input locations $X$ and $x_*$, not on the observed values. We can compute uncertainty before seeing targets!
•Minimum at Data Points — Variance is lowest near training points (approaching zero with zero noise).
•Maximum in Data Gaps — Variance is highest in regions without observations, approaching prior variance.
•Non-Negative — Mathematically guaranteed: observations reduce, never increase, uncertainty.
•Captures Input Uncertainty — Even if we're confident about nearby predictions, distant predictions remain uncertain.

The Variance Independence Property

The fact that variance is independent of observed y values is profound. It means we can plan experiments (decide where to sample next) purely based on input locations, without needing to run the experiments first. This is the foundation of active learning and Bayesian optimization.

Two Components of Predictive Uncertainty:

When making predictions, we must distinguish two sources of uncertainty:

Epistemic Uncertainty (Model Uncertainty)
- Uncertainty about the true function $f$
- Measured by posterior variance $\sigma_*^2$
- Can be reduced by collecting more data
Aleatoric Uncertainty (Observation Noise)
- Inherent randomness in observations
- Represented by $\sigma_n^2$ (noise variance)
- Cannot be reduced by more data

When predicting a new observation $y_* = f(x_*) + \epsilon$, the total predictive variance is:

$$\text{Var}(y_* | \mathcal{D}) = \sigma_*^2 + \sigma_n^2$$

The first term (reducible) plus the second term (irreducible).

uncertainty_types.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict_full(X_train, y_train, X_test, length_scale=1.0, 
                    signal_var=1.0, noise_var=0.1):
    """GP prediction with both epistemic and total predictive variance."""
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    
    # Epistemic (model) uncertainty
    epistemic_var = signal_var - np.sum(v**2, axis=0)
    epistemic_var = np.maximum(epistemic_var, 0)
    
    # Total predictive uncertainty (for new observation)
    total_var = epistemic_var + noise_var
    
    return mean, np.sqrt(epistemic_var), np.sqrt(total_var)
 
np.random.seed(42)
 
# Generate data
X_train = np.array([[-3], [-1], [0], [2]])
y_train = np.array([0.5, -0.5, 0, 1])
X_test = np.linspace(-5, 5, 200).reshape(-1, 1)
 
mean, epistemic_std, total_std = gp_predict_full(
    X_train, y_train, X_test, 
    length_scale=1.0, signal_var=1.0, noise_var=0.2
)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Both types of uncertainty
ax = axes[0]
ax.fill_between(X_test.ravel(), mean-2*total_std, mean+2*total_std, 
                 alpha=0.15, color='C1', label='Total (epistemic + noise)')
ax.fill_between(X_test.ravel(), mean-2*epistemic_std, mean+2*epistemic_std, 
                 alpha=0.3, color='C0', label='Epistemic only')
ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')
ax.scatter(X_train, y_train, c='red', s=80, zorder=5, label='Data')
ax.set_title('Epistemic vs Total Uncertainty')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 2: How uncertainty evolves as data accumulates
ax = axes[1]
X_test_local = np.linspace(-3, 3, 150).reshape(-1, 1)
X_sequential = [np.array([[-2]]), np.array([[-2], [0]]), 
                np.array([[-2], [0], [2]]), np.array([[-2], [-1], [0], [1], [2]])]
y_sequential = [np.array([0]), np.array([0, 1]), 
                np.array([0, 1, 0]), np.array([0, 0.5, 1, 0.5, 0])]
labels = ['n=1', 'n=2', 'n=3', 'n=5']
colors = ['C0', 'C1', 'C2', 'C3']
 
for X_i, y_i, label, color in zip(X_sequential, y_sequential, labels, colors):
    _, std_i, _ = gp_predict_full(X_i, y_i, X_test_local, noise_var=0.01)
    ax.plot(X_test_local, std_i, linewidth=2, color=color, label=label)
    ax.scatter(X_i, np.zeros_like(X_i), c=color, s=30, zorder=5, marker='|')
 
ax.set_title('Uncertainty Decreases with More Data')
ax.set_xlabel('x')
ax.set_ylabel('Posterior Std Dev')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('Understanding GP Uncertainty', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
# Visualization: Epistemic vs Aleatoric
fig, ax = plt.subplots(figsize=(10, 6))
noise_vars = [0.01, 0.1, 0.5]
 
for nv, style in zip(noise_vars, ['-', '--', ':']):
    mean, eps_std, tot_std = gp_predict_full(X_train, y_train, X_test, noise_var=nv)
    ax.plot(X_test, tot_std, style, linewidth=2, label=f'σₙ²={nv}')
 
ax.set_xlabel('x')
ax.set_ylabel('Total Predictive Std Dev')
ax.set_title('Effect of Noise Variance on Predictive Uncertainty')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

Confidence and Credible Intervals

Given the posterior distribution, we can construct intervals that contain the true function value with specified probability.

Bayesian Credible Intervals:

For a GP, the posterior at any point is Gaussian: $$f(x_) | \mathcal{D} \sim \mathcal{N}(\mu_, \sigma_*^2)$$

A credible interval is simply a probability interval under this posterior:

68% Credible Interval: $[\mu_* - \sigma_, \mu_ + \sigma_*]$
95% Credible Interval: $[\mu_* - 1.96\sigma_, \mu_ + 1.96\sigma_*]$
99% Credible Interval: $[\mu_* - 2.58\sigma_, \mu_ + 2.58\sigma_*]$

Interpretation:

"There is a 95% posterior probability that the true function value lies within this interval."

This is a Bayesian statement—it quantifies our belief given the data and prior. It's distinct from frequentist confidence intervals, which make statements about procedure coverage across repeated sampling.

Credible vs Confidence Intervals

Frequentist confidence intervals say: "If we repeated this experiment infinitely, 95% of intervals constructed this way would contain the true value." Bayesian credible intervals say: "Given our data and prior, there's a 95% probability the true value is here." For GP regression, credible intervals are natural and more intuitive for decision-making.

Standard Credible Interval Multipliers (Gaussian)
Probability	Multiplier (z)	Interval
68.27%	1.0	$\mu \pm \sigma$
90%	1.645	$\mu \pm 1.645\sigma$
95%	1.96	$\mu \pm 1.96\sigma$
99%	2.576	$\mu \pm 2.576\sigma$
99.7%	3.0	$\mu \pm 3\sigma$

credible_intervals.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
from scipy import stats
 
def gp_predict_with_intervals(X_train, y_train, X_test, 
                               length_scale=1.0, signal_var=1.0, noise_var=0.1,
                               confidence_levels=[0.68, 0.95, 0.99]):
    """GP prediction with multiple credible intervals."""
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    std = np.sqrt(np.maximum(var, 0))
    
    # Compute intervals for each confidence level
    intervals = {}
    for level in confidence_levels:
        z = stats.norm.ppf(0.5 + level/2)  # Two-sided z-score
        intervals[level] = {
            'lower': mean - z * std,
            'upper': mean + z * std,
            'z': z
        }
    
    return mean, std, intervals
 
np.random.seed(42)
 
# Generate synthetic data
X_train = np.linspace(-3, 3, 8).reshape(-1, 1)
y_train = np.sin(X_train.ravel()) + 0.1 * np.random.randn(len(X_train))
X_test = np.linspace(-4, 4, 200).reshape(-1, 1)
 
mean, std, intervals = gp_predict_with_intervals(
    X_train, y_train, X_test,
    length_scale=1.0, noise_var=0.01
)
 
# Visualization: Nested credible intervals
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Nested bands
ax = axes[0]
colors = ['#1f77b4', '#1f77b4', '#1f77b4']
alphas = [0.1, 0.2, 0.35]
 
for level, color, alpha in zip([0.99, 0.95, 0.68], colors, alphas):
    interval = intervals[level]
    ax.fill_between(X_test.ravel(), interval['lower'], interval['upper'],
                     alpha=alpha, color=color, 
                     label=f'{int(level*100)}% CI (±{interval["z"]:.2f}σ)')
 
ax.plot(X_test, mean, 'C0-', linewidth=2, label='Mean')
ax.plot(X_test, np.sin(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5, label='Data')
ax.set_title('Nested Credible Intervals')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
# Plot 2: Coverage analysis
ax = axes[1]
 
# Generate many test points and check coverage
n_test = 100
X_test_coverage = np.linspace(-3, 3, n_test).reshape(-1, 1)
f_true = np.sin(X_test_coverage.ravel())  # True function values
 
mean_cov, std_cov, _ = gp_predict_with_intervals(
    X_train, y_train, X_test_coverage, noise_var=0.01
)
 
# Check what percentage of true values fall within different intervals
theory_probs = []
empirical_probs = []
 
for z_mult in np.linspace(0.1, 3.5, 50):
    lower = mean_cov - z_mult * std_cov
    upper = mean_cov + z_mult * std_cov
    
    # Empirical coverage
    inside = np.sum((f_true >= lower) & (f_true <= upper))
    empirical_probs.append(inside / n_test)
    
    # Theoretical coverage
    theory_probs.append(2 * stats.norm.cdf(z_mult) - 1)
 
ax.plot(theory_probs, empirical_probs, 'o-', markersize=3, linewidth=1.5)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1.5, label='Perfect calibration')
ax.set_xlabel('Theoretical Coverage')
ax.set_ylabel('Empirical Coverage')
ax.set_title('Calibration Plot: Are Credible Intervals Calibrated?')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')
 
plt.suptitle('Credible Intervals and Calibration', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

The Full Predictive Distribution

While mean and variance summaries are useful, the full predictive distribution contains richer information.

For the Latent Function $f(x_*)$:

$$p(f_* | \mathcal{D}, x_) = \mathcal{N}(\mu_, \sigma_*^2)$$

This is the epistemic predictive distribution—uncertainty about the true function.

For a New Observation $y_*$:

If we'll observe $y_* = f(x_*) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:

$$p(y_* | \mathcal{D}, x_) = \mathcal{N}(\mu_, \sigma_*^2 + \sigma_n^2)$$

This is the total predictive distribution, accounting for both model and observation uncertainty.

Uses of the Full Distribution:

Applications of Predictive Distributions

•Probability Queries — What's the probability that $f(x_) > 0$? Easy: compute $P(Z > -\mu_/\sigma_*)$ for $Z \sim \mathcal{N}(0,1)$.
•Expected Improvement — Basis for Bayesian optimization. How much improvement over current best can we expect?
•Risk Assessment — Probability of exceeding a threshold, expected loss under various actions.
•Decision Under Uncertainty — Optimal actions depend on the full distribution, not just the mean.
•Generating Samples — Draw possible function values for Monte Carlo integration or visualization.

predictive_distribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict(X_train, y_train, X_test, length_scale=1.0, 
               signal_var=1.0, noise_var=0.1):
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
np.random.seed(42)
 
# Setup
X_train = np.array([[-2], [0], [1]])
y_train = np.array([0.5, 0, 1])
X_test = np.linspace(-3, 3, 150).reshape(-1, 1)
 
mean, std = gp_predict(X_train, y_train, X_test, noise_var=0.01)
 
# Focus on a specific test point
x_query = np.array([[1.5]])
mean_q, std_q = gp_predict(X_train, y_train, x_query, noise_var=0.01)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Full GP with highlighted query point
ax = axes[0]
ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')
ax.plot(X_test, mean, 'C0-', linewidth=2)
ax.scatter(X_train, y_train, c='red', s=80, zorder=5)
ax.axvline(x_query.item(), color='green', linewidth=2, linestyle='--')
ax.scatter([x_query.item()], [mean_q.item()], c='green', s=100, zorder=6, marker='D')
ax.set_title('GP Prediction with Query Point')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Plot 2: Predictive distribution at query point
ax = axes[1]
f_vals = np.linspace(mean_q.item() - 4*std_q.item(), mean_q.item() + 4*std_q.item(), 100)
pdf = stats.norm.pdf(f_vals, mean_q.item(), std_q.item())
 
ax.plot(f_vals, pdf, 'C0-', linewidth=2)
ax.fill_between(f_vals, 0, pdf, alpha=0.3, color='C0')
ax.axvline(mean_q.item(), color='red', linewidth=2, linestyle='--', label='Mean')
ax.axvline(0, color='gray', linewidth=1, linestyle=':', label='Threshold at 0')
 
# Shade probability f > 0
f_positive = f_vals[f_vals >= 0]
pdf_positive = stats.norm.pdf(f_positive, mean_q.item(), std_q.item())
ax.fill_between(f_positive, 0, pdf_positive, alpha=0.5, color='C1', 
                 label=f'P(f > 0) = {1 - stats.norm.cdf(0, mean_q.item(), std_q.item()):.3f}')
 
ax.set_title(f'Predictive Distribution at x = {x_query.item():.1f}')
ax.set_xlabel('f(x)')
ax.set_ylabel('Density')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 3: Probability queries across input space
ax = axes[2]
threshold = 0.5
prob_above = 1 - stats.norm.cdf(threshold, mean, std)
 
ax.fill_between(X_test.ravel(), 0, prob_above, alpha=0.3, color='C2')
ax.plot(X_test, prob_above, 'C2-', linewidth=2)
ax.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
ax.scatter(X_train, np.zeros_like(y_train) + 0.02, c='red', s=60, marker='^')
 
ax.set_title(f'P(f(x) > {threshold}) Across Input Space')
ax.set_xlabel('x')
ax.set_ylabel('Probability')
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
 
plt.suptitle('Using the Full Predictive Distribution', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Visualization Best Practices

Effective visualization of GP predictions is crucial for interpretation and communication. Here are best practices:

Standard Elements:

Mean curve: The posterior mean, typically a solid line
Uncertainty bands: Shaded regions showing credible intervals
Training data: Points used for fitting
True function (if available): For validation

Color and Transparency:

Use lighter shades for wider intervals (outer bands for 99% CI)
Darker shades for narrower intervals (inner bands for 68% CI)
High contrast between mean line and bands
Distinct color for data points (often red or black)

Avoid Common Mistakes:

Common Visualization Errors

•Missing uncertainty — Showing only the mean misses the key advantage of GPs. Always show uncertainty bands!
•Wrong interval type — Be clear whether you're showing epistemic uncertainty, total predictive uncertainty, or both.
•Unlabeled axes/intervals — Always specify what the bands represent (e.g., "95% credible interval").
•Independent marginal samples — When showing samples, use joint samples, not independent draws at each point.
•Missing data points — Training data should be visible to understand where predictions are informed vs. uncertain.

visualization_best_practices.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict(X_train, y_train, X_test, length_scale=1.0, 
               signal_var=1.0, noise_var=0.1):
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    K_ss = kernel(X_test, X_test)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    cov = K_ss - v.T @ v + 1e-8 * np.eye(len(X_test))
    return mean, cov
 
def excellent_gp_plot(X_train, y_train, X_test, mean, std, samples=None,
                       true_func=None, title='GP Regression', noise_var=0.1):
    """
    Create a publication-quality GP visualization.
    """
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Style settings
    plt.style.use('default')
    ax.set_facecolor('#fafafa')
    
    # Outer band: 99% CI
    ax.fill_between(X_test.ravel(), mean - 2.58*std, mean + 2.58*std,
                     alpha=0.1, color='#1f77b4', edgecolor='none',
                     label='99% credible interval')
    
    # Middle band: 95% CI  
    ax.fill_between(X_test.ravel(), mean - 1.96*std, mean + 1.96*std,
                     alpha=0.2, color='#1f77b4', edgecolor='none',
                     label='95% credible interval')
    
    # Inner band: 68% CI
    ax.fill_between(X_test.ravel(), mean - std, mean + std,
                     alpha=0.35, color='#1f77b4', edgecolor='none',
                     label='68% credible interval')
    
    # Posterior samples (if provided)
    if samples is not None:
        for i, sample in enumerate(samples.T[:5]):
            ax.plot(X_test, sample, color='#1f77b4', alpha=0.3, 
                     linewidth=1, zorder=2,
                     label='Posterior samples' if i == 0 else None)
    
    # Posterior mean
    ax.plot(X_test, mean, color='#1f77b4', linewidth=2.5, 
             label='Posterior mean', zorder=3)
    
    # True function (if provided)
    if true_func is not None:
        ax.plot(X_test, true_func, 'k--', linewidth=2, alpha=0.7,
                 label='True function', zorder=4)
    
    # Training data
    ax.scatter(X_train, y_train, c='#d62728', s=80, zorder=5,
                edgecolors='white', linewidths=1.5,
                label='Training observations')
    
    # Formatting
    ax.set_xlabel('Input, x', fontsize=12)
    ax.set_ylabel('Output, f(x)', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.legend(loc='upper left', framealpha=0.9)
    ax.grid(True, alpha=0.3, linestyle='--')
    
    # Add annotation about noise
    ax.text(0.98, 0.02, f'Observation noise: σₙ² = {noise_var}', 
            transform=ax.transAxes, ha='right', va='bottom',
            fontsize=9, color='gray')
    
    plt.tight_layout()
    return fig, ax
 
# Generate example
np.random.seed(42)
 
# True function
def true_func(x):
    return np.sin(x) + 0.3 * np.sin(3*x)
 
# Generate training data
X_train = np.random.uniform(-4, 4, 12).reshape(-1, 1)
X_train = np.sort(X_train, axis=0)
y_train = true_func(X_train.ravel()) + 0.15 * np.random.randn(len(X_train))
 
X_test = np.linspace(-5, 5, 200).reshape(-1, 1)
 
mean, cov = gp_predict(X_train, y_train, X_test, length_scale=1.0, noise_var=0.02)
std = np.sqrt(np.diag(cov))
 
# Generate posterior samples
L = cholesky(cov, lower=True)
samples = mean[:, np.newaxis] + L @ np.random.randn(len(X_test), 10)
 
# Create excellent visualization
fig, ax = excellent_gp_plot(
    X_train.ravel(), y_train, X_test.ravel(), 
    mean, std, samples,
    true_func=true_func(X_test.ravel()),
    title='Gaussian Process Regression with Uncertainty',
    noise_var=0.02
)
plt.show()

Decision-Making with GP Predictions

The ultimate purpose of predictions is often to support decisions. GP predictions, with their uncertainty quantification, are particularly well-suited for decision-making under uncertainty.

Key Quantities for Decisions:

Probability of Improvement (PI) $$PI(x) = P(f(x) > f^+) = \Phi\left(\frac{\mu(x) - f^+}{\sigma(x)}\right)$$ where $f^+$ is the best observed value and $\Phi$ is the standard normal CDF.
Expected Improvement (EI) $$EI(x) = \mathbb{E}[\max(f(x) - f^+, 0)]$$ $$= (\mu(x) - f^+)\Phi(Z) + \sigma(x)\phi(Z)$$ where $Z = \frac{\mu(x) - f^+}{\sigma(x)}$ and $\phi$ is the standard normal PDF.
Upper Confidence Bound (UCB) $$UCB(x) = \mu(x) + \kappa \cdot \sigma(x)$$ Balances exploitation (high $\mu$) vs exploration (high $\sigma$).

These form the basis of Bayesian Optimization—sequential decision-making for expensive black-box function optimization.

The Exploration-Exploitation Tradeoff

Should we exploit what we know (query where the mean is high) or explore where we're uncertain (query where the variance is high)? GP uncertainty naturally enables this tradeoff. Acquisition functions like EI and UCB automatically balance both considerations.

decision_making.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict(X_train, y_train, X_test, length_scale=1.0, 
               signal_var=1.0, noise_var=0.1):
    def kernel(X1, X2):
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
        return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
    
    K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train))
    K_star = kernel(X_test, X_train)
    L = cholesky(K, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
def acquisition_functions(mean, std, y_best, kappa=2.0):
    """
    Compute common acquisition functions for Bayesian optimization.
    """
    # Standardized improvement
    with np.errstate(divide='ignore', invalid='ignore'):
        Z = (mean - y_best) / std
        Z = np.where(std > 0, Z, 0)
    
    # Probability of Improvement
    PI = stats.norm.cdf(Z)
    
    # Expected Improvement  
    EI = (mean - y_best) * stats.norm.cdf(Z) + std * stats.norm.pdf(Z)
    EI = np.where(std > 0, EI, 0)
    
    # Upper Confidence Bound
    UCB = mean + kappa * std
    
    return PI, EI, UCB
 
np.random.seed(42)
 
# Simulate optimization scenario
# True function (unknown to optimizer)
def objective(x):
    return -(x - 0.5)**2 * np.sin(3*x) + 0.5
 
# Current observations
X_train = np.array([[-3], [-1], [1], [3]])
y_train = objective(X_train.ravel()) + 0.05 * np.random.randn(len(X_train))
 
# Candidate points
X_test = np.linspace(-4, 4, 200).reshape(-1, 1)
 
# GP prediction
mean, std = gp_predict(X_train, y_train, X_test, length_scale=1.0, noise_var=0.0025)
 
# Best observed value
y_best = np.max(y_train)
 
# Compute acquisitions
PI, EI, UCB = acquisition_functions(mean, std, y_best)
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: GP prediction
ax = axes[0, 0]
ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')
ax.plot(X_test, mean, 'C0-', linewidth=2, label='GP Mean')
ax.plot(X_test, objective(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True objective', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=80, zorder=5, label='Observations')
ax.axhline(y_best, color='green', linestyle=':', label=f'Best so far: {y_best:.2f}')
ax.set_title('GP Model of Objective Function')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 2: Probability of Improvement
ax = axes[0, 1]
ax.fill_between(X_test.ravel(), 0, PI, alpha=0.3, color='C1')
ax.plot(X_test, PI, 'C1-', linewidth=2)
best_PI_idx = np.argmax(PI)
ax.axvline(X_test[best_PI_idx], color='C1', linestyle='--')
ax.scatter([X_test[best_PI_idx]], [PI[best_PI_idx]], c='C1', s=100, zorder=5, marker='*')
ax.set_title('Probability of Improvement (PI)')
ax.set_xlabel('x')
ax.set_ylabel('PI(x)')
ax.grid(True, alpha=0.3)
 
# Plot 3: Expected Improvement
ax = axes[1, 0]
ax.fill_between(X_test.ravel(), 0, EI, alpha=0.3, color='C2')
ax.plot(X_test, EI, 'C2-', linewidth=2)
best_EI_idx = np.argmax(EI)
ax.axvline(X_test[best_EI_idx], color='C2', linestyle='--')
ax.scatter([X_test[best_EI_idx]], [EI[best_EI_idx]], c='C2', s=100, zorder=5, marker='*')
ax.set_title('Expected Improvement (EI)')
ax.set_xlabel('x')
ax.set_ylabel('EI(x)')
ax.grid(True, alpha=0.3)
 
# Plot 4: Upper Confidence Bound
ax = axes[1, 1]
ax.plot(X_test, mean, 'C0-', linewidth=1.5, label='Mean', alpha=0.7)
ax.plot(X_test, UCB, 'C3-', linewidth=2, label=f'UCB (κ=2)')
best_UCB_idx = np.argmax(UCB)
ax.axvline(X_test[best_UCB_idx], color='C3', linestyle='--')
ax.scatter([X_test[best_UCB_idx]], [UCB[best_UCB_idx]], c='C3', s=100, zorder=5, marker='*')
ax.fill_between(X_test.ravel(), mean, UCB, alpha=0.2, color='C3')
ax.set_title('Upper Confidence Bound (UCB)')
ax.set_xlabel('x')
ax.set_ylabel('UCB(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('Acquisition Functions for Decision-Making', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
# Summary of recommendations
print("Next point to sample:")
print(f"  According to PI:  x = {X_test[best_PI_idx].item():.3f}")
print(f"  According to EI:  x = {X_test[best_EI_idx].item():.3f}")
print(f"  According to UCB: x = {X_test[best_UCB_idx].item():.3f}")

Summary: Mean and Variance Prediction

We've explored how to interpret and use GP predictions effectively, moving beyond point estimates to full probabilistic inference.

Key Takeaways

•Mean is Optimal Under L2 Loss — The posterior mean minimizes expected squared error and is a weighted combination of observations.
•Variance Quantifies Uncertainty — Posterior variance is low near data, high far from data, and independent of observed y-values.
•Two Types of Uncertainty — Epistemic (reducible with more data) and aleatoric (irreducible observation noise).
•Credible Intervals are Probabilistic — They directly quantify the probability that the true value lies within bounds.
•Full Distributions Enable Rich Queries — Probability of exceeding thresholds, expected improvement, and more.
•Uncertainty Enables Smart Decisions — Acquisition functions leverage uncertainty for exploration-exploitation tradeoffs.

What's Next:

We'll conclude this module by comparing Gaussian Processes with Kernel Ridge Regression, understanding their connections and differences, and when to choose one over the other.

Page Complete

You now understand how to interpret GP predictions, visualize uncertainty, and use probabilistic outputs for decision-making. This practical knowledge transforms GPs from mathematical abstractions into powerful tools for real-world prediction and optimization.