Machine LearningKernel Methods for Regression

Gaussian Processes Introduction

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

5 / 5

GP vs Kernel Ridge

Two Perspectives, One Core

By now, you've learned that Gaussian Processes provide probabilistic predictions with uncertainty quantification, while Kernel Ridge Regression (from Module 2) gives us regularized predictions without explicit probabilistic interpretation. Yet these two methods are intimately connected—in some cases, they yield identical predictions.

Understanding this connection is more than an academic exercise. It reveals:

When the extra machinery of GPs (hyperparameter optimization, uncertainty quantification) is worth the cost
When simple Kernel Ridge is sufficient
How to choose between them for specific applications

This page unifies your understanding of kernel methods, showing how the same mathematical core manifests in different frameworks with different strengths.

What You Will Learn

By the end of this page, you will understand the precise mathematical relationship between GPs and Kernel Ridge Regression, how different perspectives (Bayesian vs regularization) lead to the same or different solutions, when uncertainty quantification justifies the GP approach, and practical guidelines for choosing between approaches in real applications.

Mathematical Foundations Review

Let's establish both methods on equal footing before comparing them.

Kernel Ridge Regression (revisited):

Given data $(X, \mathbf{y})$ and regularization parameter $\lambda$, Kernel Ridge Regression solves:

$$\min_{\boldsymbol{\alpha}} |\mathbf{y} - K\boldsymbol{\alpha}|^2 + \lambda \boldsymbol{\alpha}^T K \boldsymbol{\alpha}$$

The solution is: $$\boldsymbol{\alpha}^* = (K + \lambda I)^{-1}\mathbf{y}$$

Predictions at test point $x_$: $$\hat{f}(x_) = \mathbf{k}^T \boldsymbol{\alpha}^ = \mathbf{k}*^T (K + \lambda I)^{-1}\mathbf{y}$$

Gaussian Process Regression (from this module):

Given data $(X, \mathbf{y})$, noise variance $\sigma_n^2$, and kernel $k$, the GP posterior mean is: $$\mu_(x_) = \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1}\mathbf{y}$$

(Assuming zero prior mean for simplicity.)

The Key Observation

Compare the two prediction formulas. They are identical if we set $\lambda = \sigma_n^2$. The regularization parameter in Kernel Ridge corresponds exactly to the noise variance in GP regression!

Side-by-Side Comparison: Kernel Ridge vs GP
Aspect	Kernel Ridge Regression	Gaussian Process
Framework	Regularized optimization	Bayesian inference
Objective	Minimize regularized loss	Compute posterior distribution
Key parameter	$\lambda$ (regularization strength)	$\sigma_n^2$ (noise variance)
Output	Point prediction $\hat{f}(x_*)$	Distribution $p(f_* \| \mathcal{D})$
Mean prediction	$\mathbf{k}_*^T(K + \lambda I)^{-1}\mathbf{y}$	$\mathbf{k}_*^T(K + \sigma_n^2 I)^{-1}\mathbf{y}$
Uncertainty	Not provided	Posterior variance $\sigma_*^2$

The Equivalence Theorem

The connection between Kernel Ridge and GPs runs deeper than superficial similarity. There's a formal equivalence:

Theorem (Regularization = Bayesian Inference):

Kernel Ridge Regression with regularization parameter $\lambda$ is equivalent to the posterior mean of Gaussian Process regression with:

Prior: $f \sim \mathcal{GP}(0, k)$
Likelihood: $y \sim \mathcal{N}(f, \sigma_n^2)$ with $\sigma_n^2 = \lambda$

In other words, minimizing squared loss plus an RKHS norm penalty is equivalent to computing the MAP (Maximum A Posteriori) estimate under a GP prior.

Proof Sketch:

Consider the GP log-posterior: $$\log p(f | \mathbf{y}) = \log p(\mathbf{y} | f) + \log p(f) - \log p(\mathbf{y})$$

The log-likelihood (Gaussian noise): $$\log p(\mathbf{y} | f) = -\frac{1}{2\sigma_n^2}|\mathbf{y} - f(X)|^2 + \text{const}$$

The log-prior (GP with RKHS norm): $$\log p(f) = -\frac{1}{2}|f|_{\mathcal{H}_k}^2 + \text{const}$$

where $|f|_{\mathcal{H}_k}^2 = f^T K^{-1} f$ is the RKHS norm.

Maximizing the log-posterior = Minimizing: $$\frac{1}{2\sigma_n^2}|\mathbf{y} - f(X)|^2 + \frac{1}{2}f^T K^{-1} f$$

This is exactly the Kernel Ridge objective (up to reparameterization).

MAP vs Full Bayesian

Kernel Ridge gives the MAP (mode) of the GP posterior. The full GP provides the entire posterior distribution. Both have the same mean, but only the GP gives variance. This distinction matters when uncertainty is important.

The Representer Theorem Connection:

Recall the Representer Theorem from Module 2: any minimizer of a regularized risk in an RKHS lies in the span of the kernel functions at training points:

$$f^* = \sum_{i=1}^n \alpha_i k(x_i, \cdot)$$

For GPs, the posterior mean has exactly this form: $$\mu_*(\cdot) = \sum_{i=1}^n \alpha_i k(x_i, \cdot)$$

where $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}$.

The same representational structure arises from both regularization theory and Bayesian inference.

What GPs Add Beyond Kernel Ridge

Despite identical mean predictions, GPs offer several advantages over Kernel Ridge Regression:

1. Principled Uncertainty Quantification:

GPs provide posterior variance at every prediction point. This enables:

Confidence/credible intervals
Identification of regions where model is uncertain
Risk-aware decision making
Active learning / Bayesian optimization

Kernel Ridge gives no native uncertainty—you can jackknife or bootstrap, but these are post-hoc approximations.

2. Principled Hyperparameter Selection:

GPs provide the marginal likelihood (evidence): $$p(\mathbf{y} | X, \theta) = \int p(\mathbf{y} | f) p(f | \theta) df$$

where $\theta$ are hyperparameters (kernel parameters, noise variance). This can be computed in closed form and differentiated for gradient-based optimization.

Kernel Ridge typically uses cross-validation for hyperparameter selection—more expensive and potentially higher variance.

GP Advantages Over Kernel Ridge

•Posterior variance — Uncertainty quantification at every point, not just point predictions.
•Marginal likelihood — Principled model selection and hyperparameter tuning.
•Probabilistic interpretation — Natural framework for Bayesian decision-making.
•Posterior samples — Can generate plausible function samples, useful for propagating uncertainty.
•Automatic regularization tuning — Noise variance is learned, not tuned via CV.
•Coherent predictions — Joint distribution over multiple predictions respects correlation structure.

gp_vs_krr_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def se_kernel_matrix(X1, X2, length_scale=1.0, signal_var=1.0):
    """Squared Exponential kernel matrix."""
    X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
    sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
    return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
 
def kernel_ridge_predict(X_train, y_train, X_test, lmbda=0.1, **kernel_params):
    """Kernel Ridge Regression prediction (point estimate only)."""
    K = se_kernel_matrix(X_train, X_train, **kernel_params)
    K_star = se_kernel_matrix(X_test, X_train, **kernel_params)
    alpha = np.linalg.solve(K + lmbda * np.eye(len(K)), y_train)
    return K_star @ alpha
 
def gp_predict(X_train, y_train, X_test, noise_var=0.1, **kernel_params):
    """GP prediction with mean and variance."""
    K = se_kernel_matrix(X_train, X_train, **kernel_params)
    K_star = se_kernel_matrix(X_test, X_train, **kernel_params)
    signal_var = kernel_params.get('signal_var', 1.0)
    
    L = cholesky(K + noise_var * np.eye(len(K)), lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
def gp_log_marginal_likelihood(X_train, y_train, noise_var=0.1, **kernel_params):
    """Compute log marginal likelihood for hyperparameter tuning."""
    K = se_kernel_matrix(X_train, X_train, **kernel_params)
    K_noisy = K + noise_var * np.eye(len(K))
    
    L = cholesky(K_noisy, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    
    # log|K| = 2 * sum(log(diag(L)))
    log_det = 2 * np.sum(np.log(np.diag(L)))
    
    # Log marginal likelihood
    n = len(y_train)
    lml = -0.5 * y_train @ alpha - 0.5 * log_det - 0.5 * n * np.log(2*np.pi)
    
    return lml
 
np.random.seed(42)
 
# Generate data
n = 15
X_train = np.random.uniform(-4, 4, n).reshape(-1, 1)
true_func = lambda x: np.sin(x) + 0.3*np.sin(3*x)
y_train = true_func(X_train.ravel()) + 0.15 * np.random.randn(n)
 
X_test = np.linspace(-5, 5, 200).reshape(-1, 1)
 
# Compare predictions with same regularization/noise
noise_var = 0.02
lmbda = noise_var  # Set equal for comparison
 
krr_pred = kernel_ridge_predict(X_train, y_train, X_test, lmbda=lmbda, length_scale=1.0)
gp_mean, gp_std = gp_predict(X_train, y_train, X_test, noise_var=noise_var, length_scale=1.0)
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: KRR prediction (no uncertainty)
ax = axes[0, 0]
ax.plot(X_test, krr_pred, 'C0-', linewidth=2, label='KRR Prediction')
ax.plot(X_test, true_func(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5, label='Data')
ax.set_title(f'Kernel Ridge Regression (λ = {lmbda})')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 2: GP prediction (with uncertainty)
ax = axes[0, 1]
ax.fill_between(X_test.ravel(), gp_mean-2*gp_std, gp_mean+2*gp_std, 
                 alpha=0.2, color='C0', label='95% CI')
ax.plot(X_test, gp_mean, 'C0-', linewidth=2, label='GP Mean')
ax.plot(X_test, true_func(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5, label='Data')
ax.set_title(f'GP Regression (σₙ² = {noise_var})')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 3: Verify mean equivalence
ax = axes[1, 0]
ax.plot(X_test, krr_pred, 'C0-', linewidth=2, label='KRR')
ax.plot(X_test, gp_mean, 'C1--', linewidth=2, label='GP Mean')
ax.set_title('Mean Predictions are Identical!')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Compute difference
diff = np.abs(krr_pred - gp_mean)
print(f"Max difference between KRR and GP mean: {diff.max():.2e}")
 
# Plot 4: GP marginal likelihood for hyperparameter selection
ax = axes[1, 1]
length_scales = np.logspace(-1, 1, 50)
lmls = [gp_log_marginal_likelihood(X_train, y_train, noise_var=0.02, 
                                    length_scale=ls) for ls in length_scales]
 
ax.plot(length_scales, lmls, 'C2-', linewidth=2)
best_ls = length_scales[np.argmax(lmls)]
ax.axvline(best_ls, color='C2', linestyle='--', alpha=0.7)
ax.scatter([best_ls], [max(lmls)], c='C2', s=100, zorder=5, marker='*')
ax.set_xscale('log')
ax.set_title(f'Log Marginal Likelihood (GP only)\nOptimal ℓ = {best_ls:.2f}')
ax.set_xlabel('Length Scale')
ax.set_ylabel('Log Marginal Likelihood')
ax.grid(True, alpha=0.3)
 
plt.suptitle('Kernel Ridge vs Gaussian Process Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

When Kernel Ridge Might Be Preferred

Despite GP advantages, Kernel Ridge Regression has its place. Here's when it might be preferred:

1. Simplicity:

Kernel Ridge is conceptually simpler: minimize a regularized loss. No priors, no posteriors, no noise modeling. For teams unfamiliar with Bayesian methods, it's more accessible.

2. When Uncertainty Isn't Needed:

If you only need point predictions and uncertainty won't influence decisions, Kernel Ridge does the job with less overhead. Many applications (ranking, feature extraction, some classification) don't require variance estimates.

3. Computational Equivalence (for mean):

Since the mean predictions are identical, if you're not using the variance, you're doing exactly the same computation. Kernel Ridge implementations may be marginally faster (no variance calculation).

4. When Using Non-Gaussian Likelihoods:

For classification or other non-Gaussian problems, Kernel Ridge with hinge loss (= SVM) or other losses has well-developed theory. GP classification requires approximations (Laplace, EP, variational).

Kernel Ridge Advantages

•Conceptual simplicity — Pure optimization view, no probability required.
•Mature theory — Extensive generalization bounds (RKHS theory, Rademacher complexity).
•Flexible loss functions — Easy to swap squared loss for hinge, logistic, etc.
•No probabilistic assumptions — Doesn't require belief in Gaussian noise.
•Implementation availability — Widely implemented in standard libraries.

The Pragmatic View

In practice, if you're using scikit-learn's KernelRidge or GaussianProcessRegressor, the mean predictions will be nearly identical (for the same hyperparameters). The choice often comes down to: do you need uncertainty? If yes, use GP. If no, either works.

Computational Comparison

Both methods have essentially the same computational bottleneck: solving a linear system involving the $n \times n$ kernel matrix.

Training/Fitting:

Operation	Kernel Ridge	GP
Build $K$	$O(n^2)$	$O(n^2)$
Factorize $K + \lambda I$	$O(n^3)$	$O(n^3)$
Solve for $\alpha$	$O(n^2)$	$O(n^2)$
Log marginal likelihood	—	$O(n^3)$ (same as factorization)

Prediction (at $m$ test points):

Operation	Kernel Ridge	GP
Build $K_*$	$O(mn)$	$O(mn)$
Mean prediction	$O(mn)$	$O(mn)$
Variance prediction	—	$O(mn)$ additional

Practical Observations:

Computational Considerations

•Same bottleneck — Both require $O(n^3)$ for training. The cubic cost dominates.
•Variance is cheap — Computing GP variance adds $O(mn)$, marginal compared to training.
•Hyperparameter optimization — GP marginal likelihood is efficient (uses existing factorization); CV for Kernel Ridge requires multiple refits.
•Memory — Both require $O(n^2)$ to store $K$. For large $n$, this is the limit.
•Scalable variants — Both have similar approximations: inducing points, random features, etc.

computational_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import time
import matplotlib.pyplot as plt
 
def timed_kernel_ridge(X_train, y_train, X_test, lmbda=0.1):
    """Time Kernel Ridge prediction."""
    start = time.time()
    
    # Build kernel matrix
    sq_dist_train = np.sum(X_train**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_train@X_train.T
    K = np.exp(-sq_dist_train / 2)
    
    # Solve system
    alpha = np.linalg.solve(K + lmbda * np.eye(len(K)), y_train)
    
    # Predict
    sq_dist_test = np.sum(X_test**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_test@X_train.T
    K_star = np.exp(-sq_dist_test / 2)
    pred = K_star @ alpha
    
    return time.time() - start, pred
 
def timed_gp(X_train, y_train, X_test, noise_var=0.1):
    """Time GP prediction with variance."""
    start = time.time()
    
    # Build kernel matrix
    sq_dist_train = np.sum(X_train**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_train@X_train.T
    K = np.exp(-sq_dist_train / 2)
    
    # Cholesky
    L = np.linalg.cholesky(K + noise_var * np.eye(len(K)))
    
    # Solve system
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    
    # Predict mean
    sq_dist_test = np.sum(X_test**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_test@X_train.T
    K_star = np.exp(-sq_dist_test / 2)
    mean = K_star @ alpha
    
    # Predict variance (extra step)
    v = np.linalg.solve(L, K_star.T)
    var = 1.0 - np.sum(v**2, axis=0)
    
    return time.time() - start, mean, var
 
# Benchmark across sizes
train_sizes = [100, 200, 500, 1000, 2000]
n_test = 100
 
krr_times = []
gp_times = []
 
for n in train_sizes:
    print(f"Testing n = {n}...")
    X_train = np.random.randn(n, 5)  # 5D input
    y_train = np.sum(np.sin(X_train), axis=1)
    X_test = np.random.randn(n_test, 5)
    
    # Multiple runs for stability
    krr_t, gp_t = 0, 0
    n_runs = 3
    for _ in range(n_runs):
        t, _ = timed_kernel_ridge(X_train, y_train, X_test)
        krr_t += t
        t, _, _ = timed_gp(X_train, y_train, X_test)
        gp_t += t
    
    krr_times.append(krr_t / n_runs)
    gp_times.append(gp_t / n_runs)
 
# Plot timing comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
ax = axes[0]
ax.plot(train_sizes, krr_times, 'o-', linewidth=2, markersize=8, label='Kernel Ridge')
ax.plot(train_sizes, gp_times, 's-', linewidth=2, markersize=8, label='GP (with variance)')
ax.set_xlabel('Training Set Size n')
ax.set_ylabel('Time (seconds)')
ax.set_title('Computation Time Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Log-log plot to show cubic scaling
ax = axes[1]
ax.loglog(train_sizes, krr_times, 'o-', linewidth=2, markersize=8, label='Kernel Ridge')
ax.loglog(train_sizes, gp_times, 's-', linewidth=2, markersize=8, label='GP (with variance)')
 
# Reference cubic scaling
n_ref = np.array(train_sizes)
cubic_ref = (n_ref / n_ref[0])**3 * krr_times[0]
ax.loglog(train_sizes, cubic_ref, 'k--', linewidth=1.5, label='O(n³) reference', alpha=0.7)
 
ax.set_xlabel('Training Set Size n')
ax.set_ylabel('Time (seconds)')
ax.set_title('Log-Log Scale (Verifying O(n³) Scaling)')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
print("\nTiming Summary:")
print(f"{'n':>6} | {'KRR (s)':>10} | {'GP (s)':>10} | {'Overhead':>10}")
print("-" * 45)
for n, kt, gt in zip(train_sizes, krr_times, gp_times):
    overhead = (gt - kt) / kt * 100 if kt > 0 else 0
    print(f"{n:>6} | {kt:>10.4f} | {gt:>10.4f} | {overhead:>9.1f}%")

Decision Framework: Which to Use?

Here's a practical decision framework for choosing between Kernel Ridge and GPs:

When to Use Which Method
Scenario	Recommended	Reason
Need uncertainty estimates	GP	Native posterior variance; calibrated confidence intervals
Bayesian optimization / Active learning	GP	Acquisition functions require uncertainty
Just need point predictions	Either	Identical mean; Kernel Ridge slightly simpler
Non-Gaussian likelihood	Kernel SVM/other	GP classification needs approximations
Very large datasets (n > 10,000)	Approximate methods	Both need scalable variants (SVGP, random features)
Hyperparameter selection matters	GP	Marginal likelihood more principled than CV
Team unfamiliar with Bayesian methods	Kernel Ridge	Simpler conceptual model; easier to explain
Need to sample plausible functions	GP	Natural posterior sampling

The Default Recommendation

If you're unsure, start with a GP. You get everything Kernel Ridge offers (the mean prediction) plus uncertainty quantification for free. The computational overhead is minimal, and uncertainty often proves valuable even when you didn't expect it. Modern libraries (GPyTorch, GPflow, scikit-learn) make GPs easy to use.

decision_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import matplotlib.pyplot as plt
 
# Scenario: Model selection
# Goal: Choose between two kernel configurations
 
np.random.seed(42)
 
# True function (unknown)
def true_func(x):
    return np.sin(2*x) + 0.5*np.cos(5*x)
 
# Generate data
n = 20
X_train = np.random.uniform(0, 4, n).reshape(-1, 1)
y_train = true_func(X_train.ravel()) + 0.2 * np.random.randn(n)
X_test = np.linspace(0, 4, 200).reshape(-1, 1)
 
def gp_with_lml(X_train, y_train, X_test, length_scale=1.0, noise_var=0.1):
    """GP prediction with log marginal likelihood."""
    X_train = np.atleast_2d(X_train)
    X_test = np.atleast_2d(X_test)
    
    sq_dist = np.sum(X_train**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_train@X_train.T
    K = np.exp(-sq_dist / (2*length_scale**2))
    K_noisy = K + noise_var * np.eye(len(K))
    
    L = np.linalg.cholesky(K_noisy)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    
    # Mean
    sq_dist_test = np.sum(X_test**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_test@X_train.T
    K_star = np.exp(-sq_dist_test / (2*length_scale**2))
    mean = K_star @ alpha
    
    # Variance
    v = np.linalg.solve(L, K_star.T)
    var = 1.0 - np.sum(v**2, axis=0)
    std = np.sqrt(np.maximum(var, 0))
    
    # Log marginal likelihood
    n = len(y_train)
    lml = -0.5 * y_train @ alpha - np.sum(np.log(np.diag(L))) - 0.5*n*np.log(2*np.pi)
    
    return mean, std, lml
 
# Compare two models
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Model 1: Short length scale
mean1, std1, lml1 = gp_with_lml(X_train, y_train, X_test, length_scale=0.3)
ax = axes[0]
ax.fill_between(X_test.ravel(), mean1-2*std1, mean1+2*std1, alpha=0.2)
ax.plot(X_test, mean1, linewidth=2)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5)
ax.set_title(f'Model 1: ℓ=0.3\nLog ML = {lml1:.2f}')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Model 2: Long length scale  
mean2, std2, lml2 = gp_with_lml(X_train, y_train, X_test, length_scale=2.0)
ax = axes[1]
ax.fill_between(X_test.ravel(), mean2-2*std2, mean2+2*std2, alpha=0.2)
ax.plot(X_test, mean2, linewidth=2)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5)
ax.set_title(f'Model 2: ℓ=2.0\nLog ML = {lml2:.2f}')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Model selection via marginal likelihood
ax = axes[2]
length_scales = np.logspace(-1, 1, 50)
lmls = [gp_with_lml(X_train, y_train, X_test, length_scale=ls)[2] 
        for ls in length_scales]
ax.plot(length_scales, lmls, 'C0-', linewidth=2)
best_ls = length_scales[np.argmax(lmls)]
ax.axvline(best_ls, color='red', linestyle='--', label=f'Optimal ℓ={best_ls:.2f}')
ax.set_xscale('log')
ax.set_title('GP Model Selection via Marginal Likelihood')
ax.set_xlabel('Length Scale')
ax.set_ylabel('Log Marginal Likelihood')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('GPs Enable Principled Model Selection', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
print(f"Model 1 (ℓ=0.3) Log ML: {lml1:.2f}")
print(f"Model 2 (ℓ=2.0) Log ML: {lml2:.2f}")
print(f"\nMarginal likelihood selects: ℓ = {best_ls:.2f}")
print("\nWith Kernel Ridge, you'd need cross-validation to tune ℓ!")

Summary: GP vs Kernel Ridge

We've explored the deep connection between Gaussian Processes and Kernel Ridge Regression, understanding when and why to choose each approach.

Key Takeaways

•Identical Mean Predictions — With $\lambda = \sigma_n^2$, Kernel Ridge and GP give the same mean. They're different views of the same operation.
•GPs Add Uncertainty — The key difference: GPs provide posterior variance; Kernel Ridge does not.
•GPs Enable Better Model Selection — Marginal likelihood is more principled than cross-validation for hyperparameters.
•Same Computational Cost — Both require $O(n^3)$ training. Variance computation is marginal overhead.
•Choose Based on Need — If uncertainty matters, use GPs. If not, either works.
•Understanding Both is Valuable — The connection illuminates the meaning of regularization and Bayesian priors.

Module Complete:

Congratulations! You've completed the introduction to Gaussian Processes. You now understand:

What it means for a GP to define a distribution over functions
How to specify priors through mean and kernel functions
How posterior inference produces predictions with calibrated uncertainty
The interpretation of mean and variance predictions
How GPs relate to Kernel Ridge Regression

The remaining modules in this chapter will cover advanced topics: kernel design, hyperparameter optimization, computational scaling, and specialized applications.

Module Complete

You now have a solid foundation in Gaussian Process regression. The Bayesian perspective you've learned—treating functions as random variables and updating beliefs with data—is fundamental to modern probabilistic machine learning. GPs are among the most elegant methods for combining flexibility with principled uncertainty quantification.

5 / 5

Loading learning content...

Machine LearningKernel Methods for Regression

Gaussian Processes Introduction

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

5 / 5

GP vs Kernel Ridge

Two Perspectives, One Core

Understanding this connection is more than an academic exercise. It reveals:

When the extra machinery of GPs (hyperparameter optimization, uncertainty quantification) is worth the cost
When simple Kernel Ridge is sufficient
How to choose between them for specific applications

This page unifies your understanding of kernel methods, showing how the same mathematical core manifests in different frameworks with different strengths.

What You Will Learn

Mathematical Foundations Review

Let's establish both methods on equal footing before comparing them.

Kernel Ridge Regression (revisited):

Given data $(X, \mathbf{y})$ and regularization parameter $\lambda$, Kernel Ridge Regression solves:

$$\min_{\boldsymbol{\alpha}} |\mathbf{y} - K\boldsymbol{\alpha}|^2 + \lambda \boldsymbol{\alpha}^T K \boldsymbol{\alpha}$$

The solution is: $$\boldsymbol{\alpha}^* = (K + \lambda I)^{-1}\mathbf{y}$$

Predictions at test point $x_$: $$\hat{f}(x_) = \mathbf{k}^T \boldsymbol{\alpha}^ = \mathbf{k}*^T (K + \lambda I)^{-1}\mathbf{y}$$

Gaussian Process Regression (from this module):

Given data $(X, \mathbf{y})$, noise variance $\sigma_n^2$, and kernel $k$, the GP posterior mean is: $$\mu_(x_) = \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1}\mathbf{y}$$

(Assuming zero prior mean for simplicity.)

The Key Observation

Compare the two prediction formulas. They are identical if we set $\lambda = \sigma_n^2$. The regularization parameter in Kernel Ridge corresponds exactly to the noise variance in GP regression!

Side-by-Side Comparison: Kernel Ridge vs GP
Aspect	Kernel Ridge Regression	Gaussian Process
Framework	Regularized optimization	Bayesian inference
Objective	Minimize regularized loss	Compute posterior distribution
Key parameter	$\lambda$ (regularization strength)	$\sigma_n^2$ (noise variance)
Output	Point prediction $\hat{f}(x_*)$	Distribution $p(f_* \| \mathcal{D})$
Mean prediction	$\mathbf{k}_*^T(K + \lambda I)^{-1}\mathbf{y}$	$\mathbf{k}_*^T(K + \sigma_n^2 I)^{-1}\mathbf{y}$
Uncertainty	Not provided	Posterior variance $\sigma_*^2$

The Equivalence Theorem

The connection between Kernel Ridge and GPs runs deeper than superficial similarity. There's a formal equivalence:

Theorem (Regularization = Bayesian Inference):

Kernel Ridge Regression with regularization parameter $\lambda$ is equivalent to the posterior mean of Gaussian Process regression with:

Prior: $f \sim \mathcal{GP}(0, k)$
Likelihood: $y \sim \mathcal{N}(f, \sigma_n^2)$ with $\sigma_n^2 = \lambda$

In other words, minimizing squared loss plus an RKHS norm penalty is equivalent to computing the MAP (Maximum A Posteriori) estimate under a GP prior.

Proof Sketch:

Consider the GP log-posterior: $$\log p(f | \mathbf{y}) = \log p(\mathbf{y} | f) + \log p(f) - \log p(\mathbf{y})$$

The log-likelihood (Gaussian noise): $$\log p(\mathbf{y} | f) = -\frac{1}{2\sigma_n^2}|\mathbf{y} - f(X)|^2 + \text{const}$$

The log-prior (GP with RKHS norm): $$\log p(f) = -\frac{1}{2}|f|_{\mathcal{H}_k}^2 + \text{const}$$

where $|f|_{\mathcal{H}_k}^2 = f^T K^{-1} f$ is the RKHS norm.

Maximizing the log-posterior = Minimizing: $$\frac{1}{2\sigma_n^2}|\mathbf{y} - f(X)|^2 + \frac{1}{2}f^T K^{-1} f$$

This is exactly the Kernel Ridge objective (up to reparameterization).

MAP vs Full Bayesian

The Representer Theorem Connection:

Recall the Representer Theorem from Module 2: any minimizer of a regularized risk in an RKHS lies in the span of the kernel functions at training points:

$$f^* = \sum_{i=1}^n \alpha_i k(x_i, \cdot)$$

For GPs, the posterior mean has exactly this form: $$\mu_*(\cdot) = \sum_{i=1}^n \alpha_i k(x_i, \cdot)$$

where $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}$.

The same representational structure arises from both regularization theory and Bayesian inference.

What GPs Add Beyond Kernel Ridge

Despite identical mean predictions, GPs offer several advantages over Kernel Ridge Regression:

1. Principled Uncertainty Quantification:

GPs provide posterior variance at every prediction point. This enables:

Confidence/credible intervals
Identification of regions where model is uncertain
Risk-aware decision making
Active learning / Bayesian optimization

Kernel Ridge gives no native uncertainty—you can jackknife or bootstrap, but these are post-hoc approximations.

2. Principled Hyperparameter Selection:

GPs provide the marginal likelihood (evidence): $$p(\mathbf{y} | X, \theta) = \int p(\mathbf{y} | f) p(f | \theta) df$$

where $\theta$ are hyperparameters (kernel parameters, noise variance). This can be computed in closed form and differentiated for gradient-based optimization.

Kernel Ridge typically uses cross-validation for hyperparameter selection—more expensive and potentially higher variance.

GP Advantages Over Kernel Ridge

•Posterior variance — Uncertainty quantification at every point, not just point predictions.
•Marginal likelihood — Principled model selection and hyperparameter tuning.
•Probabilistic interpretation — Natural framework for Bayesian decision-making.
•Posterior samples — Can generate plausible function samples, useful for propagating uncertainty.
•Automatic regularization tuning — Noise variance is learned, not tuned via CV.
•Coherent predictions — Joint distribution over multiple predictions respects correlation structure.

gp_vs_krr_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky, solve_triangular
 
def se_kernel_matrix(X1, X2, length_scale=1.0, signal_var=1.0):
    """Squared Exponential kernel matrix."""
    X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
    sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T
    return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2))
 
def kernel_ridge_predict(X_train, y_train, X_test, lmbda=0.1, **kernel_params):
    """Kernel Ridge Regression prediction (point estimate only)."""
    K = se_kernel_matrix(X_train, X_train, **kernel_params)
    K_star = se_kernel_matrix(X_test, X_train, **kernel_params)
    alpha = np.linalg.solve(K + lmbda * np.eye(len(K)), y_train)
    return K_star @ alpha
 
def gp_predict(X_train, y_train, X_test, noise_var=0.1, **kernel_params):
    """GP prediction with mean and variance."""
    K = se_kernel_matrix(X_train, X_train, **kernel_params)
    K_star = se_kernel_matrix(X_test, X_train, **kernel_params)
    signal_var = kernel_params.get('signal_var', 1.0)
    
    L = cholesky(K + noise_var * np.eye(len(K)), lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    v = solve_triangular(L, K_star.T, lower=True)
    
    mean = K_star @ alpha
    var = signal_var - np.sum(v**2, axis=0)
    return mean, np.sqrt(np.maximum(var, 0))
 
def gp_log_marginal_likelihood(X_train, y_train, noise_var=0.1, **kernel_params):
    """Compute log marginal likelihood for hyperparameter tuning."""
    K = se_kernel_matrix(X_train, X_train, **kernel_params)
    K_noisy = K + noise_var * np.eye(len(K))
    
    L = cholesky(K_noisy, lower=True)
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    
    # log|K| = 2 * sum(log(diag(L)))
    log_det = 2 * np.sum(np.log(np.diag(L)))
    
    # Log marginal likelihood
    n = len(y_train)
    lml = -0.5 * y_train @ alpha - 0.5 * log_det - 0.5 * n * np.log(2*np.pi)
    
    return lml
 
np.random.seed(42)
 
# Generate data
n = 15
X_train = np.random.uniform(-4, 4, n).reshape(-1, 1)
true_func = lambda x: np.sin(x) + 0.3*np.sin(3*x)
y_train = true_func(X_train.ravel()) + 0.15 * np.random.randn(n)
 
X_test = np.linspace(-5, 5, 200).reshape(-1, 1)
 
# Compare predictions with same regularization/noise
noise_var = 0.02
lmbda = noise_var  # Set equal for comparison
 
krr_pred = kernel_ridge_predict(X_train, y_train, X_test, lmbda=lmbda, length_scale=1.0)
gp_mean, gp_std = gp_predict(X_train, y_train, X_test, noise_var=noise_var, length_scale=1.0)
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: KRR prediction (no uncertainty)
ax = axes[0, 0]
ax.plot(X_test, krr_pred, 'C0-', linewidth=2, label='KRR Prediction')
ax.plot(X_test, true_func(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5, label='Data')
ax.set_title(f'Kernel Ridge Regression (λ = {lmbda})')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 2: GP prediction (with uncertainty)
ax = axes[0, 1]
ax.fill_between(X_test.ravel(), gp_mean-2*gp_std, gp_mean+2*gp_std, 
                 alpha=0.2, color='C0', label='95% CI')
ax.plot(X_test, gp_mean, 'C0-', linewidth=2, label='GP Mean')
ax.plot(X_test, true_func(X_test.ravel()), 'k--', linewidth=1.5, 
        label='True function', alpha=0.7)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5, label='Data')
ax.set_title(f'GP Regression (σₙ² = {noise_var})')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 3: Verify mean equivalence
ax = axes[1, 0]
ax.plot(X_test, krr_pred, 'C0-', linewidth=2, label='KRR')
ax.plot(X_test, gp_mean, 'C1--', linewidth=2, label='GP Mean')
ax.set_title('Mean Predictions are Identical!')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Compute difference
diff = np.abs(krr_pred - gp_mean)
print(f"Max difference between KRR and GP mean: {diff.max():.2e}")
 
# Plot 4: GP marginal likelihood for hyperparameter selection
ax = axes[1, 1]
length_scales = np.logspace(-1, 1, 50)
lmls = [gp_log_marginal_likelihood(X_train, y_train, noise_var=0.02, 
                                    length_scale=ls) for ls in length_scales]
 
ax.plot(length_scales, lmls, 'C2-', linewidth=2)
best_ls = length_scales[np.argmax(lmls)]
ax.axvline(best_ls, color='C2', linestyle='--', alpha=0.7)
ax.scatter([best_ls], [max(lmls)], c='C2', s=100, zorder=5, marker='*')
ax.set_xscale('log')
ax.set_title(f'Log Marginal Likelihood (GP only)\nOptimal ℓ = {best_ls:.2f}')
ax.set_xlabel('Length Scale')
ax.set_ylabel('Log Marginal Likelihood')
ax.grid(True, alpha=0.3)
 
plt.suptitle('Kernel Ridge vs Gaussian Process Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

When Kernel Ridge Might Be Preferred

Despite GP advantages, Kernel Ridge Regression has its place. Here's when it might be preferred:

1. Simplicity:

Kernel Ridge is conceptually simpler: minimize a regularized loss. No priors, no posteriors, no noise modeling. For teams unfamiliar with Bayesian methods, it's more accessible.

2. When Uncertainty Isn't Needed:

3. Computational Equivalence (for mean):

Since the mean predictions are identical, if you're not using the variance, you're doing exactly the same computation. Kernel Ridge implementations may be marginally faster (no variance calculation).

4. When Using Non-Gaussian Likelihoods:

Kernel Ridge Advantages

•Conceptual simplicity — Pure optimization view, no probability required.
•Mature theory — Extensive generalization bounds (RKHS theory, Rademacher complexity).
•Flexible loss functions — Easy to swap squared loss for hinge, logistic, etc.
•No probabilistic assumptions — Doesn't require belief in Gaussian noise.
•Implementation availability — Widely implemented in standard libraries.

The Pragmatic View

Computational Comparison

Both methods have essentially the same computational bottleneck: solving a linear system involving the $n \times n$ kernel matrix.

Training/Fitting:

Operation	Kernel Ridge	GP
Build $K$	$O(n^2)$	$O(n^2)$
Factorize $K + \lambda I$	$O(n^3)$	$O(n^3)$
Solve for $\alpha$	$O(n^2)$	$O(n^2)$
Log marginal likelihood	—	$O(n^3)$ (same as factorization)

Prediction (at $m$ test points):

Operation	Kernel Ridge	GP
Build $K_*$	$O(mn)$	$O(mn)$
Mean prediction	$O(mn)$	$O(mn)$
Variance prediction	—	$O(mn)$ additional

Practical Observations:

Computational Considerations

•Same bottleneck — Both require $O(n^3)$ for training. The cubic cost dominates.
•Variance is cheap — Computing GP variance adds $O(mn)$, marginal compared to training.
•Hyperparameter optimization — GP marginal likelihood is efficient (uses existing factorization); CV for Kernel Ridge requires multiple refits.
•Memory — Both require $O(n^2)$ to store $K$. For large $n$, this is the limit.
•Scalable variants — Both have similar approximations: inducing points, random features, etc.

computational_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import time
import matplotlib.pyplot as plt
 
def timed_kernel_ridge(X_train, y_train, X_test, lmbda=0.1):
    """Time Kernel Ridge prediction."""
    start = time.time()
    
    # Build kernel matrix
    sq_dist_train = np.sum(X_train**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_train@X_train.T
    K = np.exp(-sq_dist_train / 2)
    
    # Solve system
    alpha = np.linalg.solve(K + lmbda * np.eye(len(K)), y_train)
    
    # Predict
    sq_dist_test = np.sum(X_test**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_test@X_train.T
    K_star = np.exp(-sq_dist_test / 2)
    pred = K_star @ alpha
    
    return time.time() - start, pred
 
def timed_gp(X_train, y_train, X_test, noise_var=0.1):
    """Time GP prediction with variance."""
    start = time.time()
    
    # Build kernel matrix
    sq_dist_train = np.sum(X_train**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_train@X_train.T
    K = np.exp(-sq_dist_train / 2)
    
    # Cholesky
    L = np.linalg.cholesky(K + noise_var * np.eye(len(K)))
    
    # Solve system
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    
    # Predict mean
    sq_dist_test = np.sum(X_test**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_test@X_train.T
    K_star = np.exp(-sq_dist_test / 2)
    mean = K_star @ alpha
    
    # Predict variance (extra step)
    v = np.linalg.solve(L, K_star.T)
    var = 1.0 - np.sum(v**2, axis=0)
    
    return time.time() - start, mean, var
 
# Benchmark across sizes
train_sizes = [100, 200, 500, 1000, 2000]
n_test = 100
 
krr_times = []
gp_times = []
 
for n in train_sizes:
    print(f"Testing n = {n}...")
    X_train = np.random.randn(n, 5)  # 5D input
    y_train = np.sum(np.sin(X_train), axis=1)
    X_test = np.random.randn(n_test, 5)
    
    # Multiple runs for stability
    krr_t, gp_t = 0, 0
    n_runs = 3
    for _ in range(n_runs):
        t, _ = timed_kernel_ridge(X_train, y_train, X_test)
        krr_t += t
        t, _, _ = timed_gp(X_train, y_train, X_test)
        gp_t += t
    
    krr_times.append(krr_t / n_runs)
    gp_times.append(gp_t / n_runs)
 
# Plot timing comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
ax = axes[0]
ax.plot(train_sizes, krr_times, 'o-', linewidth=2, markersize=8, label='Kernel Ridge')
ax.plot(train_sizes, gp_times, 's-', linewidth=2, markersize=8, label='GP (with variance)')
ax.set_xlabel('Training Set Size n')
ax.set_ylabel('Time (seconds)')
ax.set_title('Computation Time Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Log-log plot to show cubic scaling
ax = axes[1]
ax.loglog(train_sizes, krr_times, 'o-', linewidth=2, markersize=8, label='Kernel Ridge')
ax.loglog(train_sizes, gp_times, 's-', linewidth=2, markersize=8, label='GP (with variance)')
 
# Reference cubic scaling
n_ref = np.array(train_sizes)
cubic_ref = (n_ref / n_ref[0])**3 * krr_times[0]
ax.loglog(train_sizes, cubic_ref, 'k--', linewidth=1.5, label='O(n³) reference', alpha=0.7)
 
ax.set_xlabel('Training Set Size n')
ax.set_ylabel('Time (seconds)')
ax.set_title('Log-Log Scale (Verifying O(n³) Scaling)')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
print("\nTiming Summary:")
print(f"{'n':>6} | {'KRR (s)':>10} | {'GP (s)':>10} | {'Overhead':>10}")
print("-" * 45)
for n, kt, gt in zip(train_sizes, krr_times, gp_times):
    overhead = (gt - kt) / kt * 100 if kt > 0 else 0
    print(f"{n:>6} | {kt:>10.4f} | {gt:>10.4f} | {overhead:>9.1f}%")

Decision Framework: Which to Use?

Here's a practical decision framework for choosing between Kernel Ridge and GPs:

When to Use Which Method
Scenario	Recommended	Reason
Need uncertainty estimates	GP	Native posterior variance; calibrated confidence intervals
Bayesian optimization / Active learning	GP	Acquisition functions require uncertainty
Just need point predictions	Either	Identical mean; Kernel Ridge slightly simpler
Non-Gaussian likelihood	Kernel SVM/other	GP classification needs approximations
Very large datasets (n > 10,000)	Approximate methods	Both need scalable variants (SVGP, random features)
Hyperparameter selection matters	GP	Marginal likelihood more principled than CV
Team unfamiliar with Bayesian methods	Kernel Ridge	Simpler conceptual model; easier to explain
Need to sample plausible functions	GP	Natural posterior sampling

The Default Recommendation

decision_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import matplotlib.pyplot as plt
 
# Scenario: Model selection
# Goal: Choose between two kernel configurations
 
np.random.seed(42)
 
# True function (unknown)
def true_func(x):
    return np.sin(2*x) + 0.5*np.cos(5*x)
 
# Generate data
n = 20
X_train = np.random.uniform(0, 4, n).reshape(-1, 1)
y_train = true_func(X_train.ravel()) + 0.2 * np.random.randn(n)
X_test = np.linspace(0, 4, 200).reshape(-1, 1)
 
def gp_with_lml(X_train, y_train, X_test, length_scale=1.0, noise_var=0.1):
    """GP prediction with log marginal likelihood."""
    X_train = np.atleast_2d(X_train)
    X_test = np.atleast_2d(X_test)
    
    sq_dist = np.sum(X_train**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_train@X_train.T
    K = np.exp(-sq_dist / (2*length_scale**2))
    K_noisy = K + noise_var * np.eye(len(K))
    
    L = np.linalg.cholesky(K_noisy)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    
    # Mean
    sq_dist_test = np.sum(X_test**2, 1).reshape(-1,1) + np.sum(X_train**2, 1) - 2*X_test@X_train.T
    K_star = np.exp(-sq_dist_test / (2*length_scale**2))
    mean = K_star @ alpha
    
    # Variance
    v = np.linalg.solve(L, K_star.T)
    var = 1.0 - np.sum(v**2, axis=0)
    std = np.sqrt(np.maximum(var, 0))
    
    # Log marginal likelihood
    n = len(y_train)
    lml = -0.5 * y_train @ alpha - np.sum(np.log(np.diag(L))) - 0.5*n*np.log(2*np.pi)
    
    return mean, std, lml
 
# Compare two models
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Model 1: Short length scale
mean1, std1, lml1 = gp_with_lml(X_train, y_train, X_test, length_scale=0.3)
ax = axes[0]
ax.fill_between(X_test.ravel(), mean1-2*std1, mean1+2*std1, alpha=0.2)
ax.plot(X_test, mean1, linewidth=2)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5)
ax.set_title(f'Model 1: ℓ=0.3\nLog ML = {lml1:.2f}')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Model 2: Long length scale  
mean2, std2, lml2 = gp_with_lml(X_train, y_train, X_test, length_scale=2.0)
ax = axes[1]
ax.fill_between(X_test.ravel(), mean2-2*std2, mean2+2*std2, alpha=0.2)
ax.plot(X_test, mean2, linewidth=2)
ax.scatter(X_train, y_train, c='red', s=60, zorder=5)
ax.set_title(f'Model 2: ℓ=2.0\nLog ML = {lml2:.2f}')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.grid(True, alpha=0.3)
 
# Model selection via marginal likelihood
ax = axes[2]
length_scales = np.logspace(-1, 1, 50)
lmls = [gp_with_lml(X_train, y_train, X_test, length_scale=ls)[2] 
        for ls in length_scales]
ax.plot(length_scales, lmls, 'C0-', linewidth=2)
best_ls = length_scales[np.argmax(lmls)]
ax.axvline(best_ls, color='red', linestyle='--', label=f'Optimal ℓ={best_ls:.2f}')
ax.set_xscale('log')
ax.set_title('GP Model Selection via Marginal Likelihood')
ax.set_xlabel('Length Scale')
ax.set_ylabel('Log Marginal Likelihood')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.suptitle('GPs Enable Principled Model Selection', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
print(f"Model 1 (ℓ=0.3) Log ML: {lml1:.2f}")
print(f"Model 2 (ℓ=2.0) Log ML: {lml2:.2f}")
print(f"\nMarginal likelihood selects: ℓ = {best_ls:.2f}")
print("\nWith Kernel Ridge, you'd need cross-validation to tune ℓ!")

Summary: GP vs Kernel Ridge

We've explored the deep connection between Gaussian Processes and Kernel Ridge Regression, understanding when and why to choose each approach.

Key Takeaways

•Identical Mean Predictions — With $\lambda = \sigma_n^2$, Kernel Ridge and GP give the same mean. They're different views of the same operation.
•GPs Add Uncertainty — The key difference: GPs provide posterior variance; Kernel Ridge does not.
•GPs Enable Better Model Selection — Marginal likelihood is more principled than cross-validation for hyperparameters.
•Same Computational Cost — Both require $O(n^3)$ training. Variance computation is marginal overhead.
•Choose Based on Need — If uncertainty matters, use GPs. If not, either works.
•Understanding Both is Valuable — The connection illuminates the meaning of regularization and Bayesian priors.

Module Complete:

Congratulations! You've completed the introduction to Gaussian Processes. You now understand:

What it means for a GP to define a distribution over functions
How to specify priors through mean and kernel functions
How posterior inference produces predictions with calibrated uncertainty
The interpretation of mean and variance predictions
How GPs relate to Kernel Ridge Regression

The remaining modules in this chapter will cover advanced topics: kernel design, hyperparameter optimization, computational scaling, and specialized applications.

Module Complete

5 / 5