Machine LearningKernel Ridge Regression

Kernel Ridge Regression

LevelAdvanced

Duration90 mins

TopicKernel Ridge Regression

3 / 5

Prediction with Kernels

From Training to Prediction

We've derived the dual formulation and studied the kernel matrix. Now comes the moment of truth: How do we use the trained model to make predictions on new, unseen data?

In kernel ridge regression, prediction takes a remarkably elegant form. Rather than explicitly computing weights in a high-dimensional feature space, we express predictions as weighted combinations of kernel evaluations between the test point and training points. This not only avoids the computational burden of feature space but also provides rich geometric and probabilistic interpretations.

This page explores prediction in depth—from the basic formula to sophisticated interpretations that connect kernel ridge regression to smoothing, interpolation, and Gaussian processes.

What You Will Learn

By the end of this page, you will: • Master the kernel prediction formula and understand each component • Visualize predictions geometrically in feature space • Interpret predictions as weighted averages with kernel weights • Understand the smoother matrix (hat matrix) and its properties • Connect kernel predictions to local regression and smoothing • Explore confidence and influence of training points • Handle multiple test points efficiently

The Prediction Formula

Given a new test point $\mathbf{x}_*$, the kernel ridge regression prediction is:

$$f(\mathbf{x}) = \mathbf{w}^{\top} \phi(\mathbf{x}*)$$

Using the representer theorem ($\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* \phi(\mathbf{x}_i)$), this becomes:

The Kernel Prediction Formula

$$f(\mathbf{x}*) = \sum{i=1}^n \alpha_i^* k(\mathbf{x}i, \mathbf{x}) = \mathbf{k}_^\top \boldsymbol{\alpha}^*$$

where: • $\mathbf{k}_* = (k(\mathbf{x}1, \mathbf{x}), \ldots, k(\mathbf{x}n, \mathbf{x}))^\top$ is the vector of kernel evaluations • $\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$ are the trained dual coefficients

Substituting the solution for $\boldsymbol{\alpha}^*$:

$$f(\mathbf{x}*) = \mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-1} \mathbf{y}$$

Breaking Down the Components:

Kernel vector $\mathbf{k}_*$: Measures similarity between $\mathbf{x}*$ and each training point. Points with high similarity to $\mathbf{x}*$ will have larger kernel values.
Regularized inverse $(\mathbf{K} + \lambda\mathbf{I})^{-1}$: Transforms the kernel vector accounting for:
- Dependencies between training points (captured by $\mathbf{K}$)
- Regularization strength (captured by $\lambda$)
- This is the "correlation-aware" transformation
Target vector $\mathbf{y}$: The training targets we aim to predict

Computational Flow:

1. Training (once):      Compute K, solve (K + λI)α = y for α
2. Prediction (per x*):  Compute k*, return k*ᵀα

After training, predictions require only $O(n)$ kernel evaluations and an $O(n)$ dot product per test point.

The Prediction Weight Vector

Define the prediction weights $\mathbf{w}{\text{pred}}(\mathbf{x}) = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{k}_$. Then:

$$f(\mathbf{x}*) = \sum{i=1}^n w_i(\mathbf{x}*) \cdot y_i = \mathbf{w}{\text{pred}}^\top \mathbf{y}$$

This reveals prediction as a weighted average of training targets, where the weights depend on the test point x*. Different test points have different weight distributions over training points.

Geometric Interpretation in Feature Space

Understanding predictions geometrically deepens our intuition about what kernel ridge regression computes.

The Feature Space View:

Recall that the kernel corresponds to an inner product in feature space: $$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$

The prediction formula becomes: $$f(\mathbf{x}) = \langle \mathbf{w}^, \phi(\mathbf{x}*) \rangle$$

This is a linear function in feature space—the inner product of the learned weight vector with the test point's feature representation.

Interpretation:

$\phi(\mathbf{x}_*)$ is a vector in feature space $\mathcal{H}$
$\mathbf{w}^*$ is another vector in $\mathcal{H}$ (the model)
The prediction is the projection of $\phi(\mathbf{x}_)$ onto $\mathbf{w}^$

Even when $\mathcal{H}$ is infinite-dimensional, this inner product is well-defined and computable via the kernel.

The Representer Geometry

Since $\mathbf{w}^* = \sum_i \alpha_i^* \phi(\mathbf{x}_i)$, the model weight vector is a linear combination of training point feature vectors. Geometrically:

• The training points $\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}n)$ span an at most n-dimensional subspace of $\mathcal{H}$ • The optimal $\mathbf{w}^*$ lies in this subspace • Prediction projects $\phi(\mathbf{x})$ onto $\mathbf{w}^$ within this subspace

Understanding Through the Kernel Trick:

$$f(\mathbf{x}) = \langle \mathbf{w}^, \phi(\mathbf{x}) \rangle = \left\langle \sum_{i=1}^n \alpha_i^ \phi(\mathbf{x}i), \phi(\mathbf{x}*) \right\rangle$$

$$= \sum_{i=1}^n \alpha_i^* \langle \phi(\mathbf{x}i), \phi(\mathbf{x}) \rangle = \sum_{i=1}^n \alpha_i^ k(\mathbf{x}i, \mathbf{x}*)$$

Each training point $i$ contributes $\alpha_i^* k(\mathbf{x}i, \mathbf{x}*)$ to the prediction:

$\alpha_i^*$: the "importance" of point $i$ determined by training
$k(\mathbf{x}i, \mathbf{x}*)$: similarity between training point $i$ and test point

Similarity-Weighted Voting:

Think of prediction as each training point "voting" for a value, with votes weighted by:

How important that training point is (captured in $\alpha_i^*$)
How similar the test point is to that training point (captured in $k(\mathbf{x}i, \mathbf{x}*)$)

Training Point Influence

•High |αᵢ|: Point i is 'surprising' or influential
•Small |αᵢ|: Point i is 'predictable' from others
•αᵢ > 0: Point i 'pushes' predictions upward
•αᵢ < 0: Point i 'pulls' predictions downward
•αᵢ ≈ 0: Point i has little effect on predictions

Similarity Weighting

•k(xᵢ, x) large*: xᵢ is similar to x*
•k(xᵢ, x) small*: xᵢ is far from x*
•RBF: Exponential decay with distance
•Polynomial: Similarity via inner product
•Local kernels: Only nearby points contribute

The Smoother (Hat) Matrix

When predicting at training points, the predictions take a special form that reveals important structure.

Definition:

The smoother matrix (or hat matrix) $\mathbf{H}$ is defined as: $$\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$$

The vector of predictions at training points is: $$\hat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}^* = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y} = \mathbf{H}\mathbf{y}$$

The name "hat matrix" comes from putting a "hat" on $\mathbf{y}$ to get predictions: $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$.

Properties of the Smoother Matrix

The hat matrix H = K(K + λI)⁻¹ has remarkable properties:

Symmetry: H = Hᵀ (since K is symmetric)
Eigenvalues: λᵢ(H) = λᵢ(K)/(λᵢ(K) + λ) ∈ [0, 1)
Trace (degrees of freedom): tr(H) = Σᵢ λᵢ(K)/(λᵢ(K) + λ)
Limiting cases:
- λ → 0: H → I (interpolation)
- λ → ∞: H → 0 (null prediction)
Idempotency failure: H² ≠ H in general (unlike OLS hat matrix when λ = 0)

Interpretation as Smoothing:

The smoother matrix transforms targets $\mathbf{y}$ into predictions $\hat{\mathbf{y}}$. This transformation:

Shrinks amplitudes: Eigenvalues of $\mathbf{H}$ are in $[0, 1)$, so $|\hat{\mathbf{y}}| \leq |\mathbf{y}|$
Filters high-frequency components: Small eigenvalues of $\mathbf{K}$ (corresponding to "noisy" directions) are heavily suppressed by the ratio $\lambda_i / (\lambda_i + \lambda)$
Preserves low-frequency components: Large eigenvalues of $\mathbf{K}$ (corresponding to "smooth" patterns) are nearly preserved

The Degrees of Freedom:

$$\text{df}(\lambda) = \text{trace}(\mathbf{H}) = \sum_{i=1}^n \frac{\lambda_i(\mathbf{K})}{\lambda_i(\mathbf{K}) + \lambda}$$

This measures the effective model complexity:

df → n as $\lambda \to 0$ (complex, interpolating model)
df → 0 as $\lambda \to \infty$ (simple, shrunk model)

The degrees of freedom is crucial for model comparison and cross-validation.

Leverage and Influence

The diagonal element $H_{ii}$ is called the leverage of point $i$. It measures how much the prediction at xᵢ depends on yᵢ itself:

• High leverage (Hᵢᵢ ≈ 1): Prediction at xᵢ ≈ yᵢ (point has high self-influence) • Low leverage (Hᵢᵢ ≈ 0): Prediction at xᵢ depends mainly on other points

Points with high leverage are potential outliers or points in regions with few neighbors.

Interpolation vs. Smoothing: The Role of λ

The regularization parameter $\lambda$ controls the fundamental trade-off between fitting the data exactly and producing smooth predictions.

Limit Case 1: Pure Interpolation (λ → 0)

As $\lambda \to 0$: $$\boldsymbol{\alpha}^* \to \mathbf{K}^{-1}\mathbf{y}$$ $$\mathbf{H} \to \mathbf{I}$$

Predictions at training points become $\hat{\mathbf{y}} = \mathbf{y}$—exact interpolation. The model passes exactly through each training point.

Consequences of interpolation:

Zero training error
Maximum complexity (df = n)
Sensitive to noise (each yᵢ is perfectly reproduced, including noise)
Potentially poor generalization
Numerically problematic if K is ill-conditioned

The Runge Phenomenon Connection

Pure interpolation with kernels can exhibit Runge-like phenomena: wild oscillations between data points, especially with RBF kernels at small bandwidths. The prediction may look perfect at training points but behave erratically in between. Regularization is essential for stable, generalizable predictions.

Limit Case 2: Maximum Smoothing (λ → ∞)

As $\lambda \to \infty$: $$\boldsymbol{\alpha}^* \to \mathbf{0}$$ $$\mathbf{H} \to \mathbf{0}$$

Predictions become $\hat{\mathbf{y}} \to \mathbf{0}$—the trivial zero function.

Consequences of over-smoothing:

High training error (model ignores the data)
Minimum complexity (df → 0)
Underfitting: fails to capture any structure
Numerically trivial but useless

The Goldilocks Zone:

Optimal $\lambda$ balances these extremes:

Small enough $\lambda$: captures genuine patterns in the data
Large enough $\lambda$: filters out noise and ensures stability

Effect of λ on Kernel Ridge Regression
λ Value	Behavior	Training Error	df	Risk
→ 0	Interpolation	Zero	n	Overfitting, instability
Small	Flexible fit	Low	High	Possible overfitting
Optimal	Balanced	Moderate	Moderate	Best generalization
Large	Smooth fit	High	Low	Possible underfitting
→ ∞	Null model	Maximum	0	Complete underfitting

Practical Rule of Thumb

Start with λ around the median eigenvalue of K, then tune using cross-validation. If λ is much smaller than the smallest eigenvalue, expect numerical issues. If λ is much larger than the largest eigenvalue, expect severe underfitting.

Connection to Local Regression

Kernel ridge regression shares deep connections with local regression methods like LOESS (Local Regression) and Nadaraya-Watson estimation.

The Nadaraya-Watson Estimator:

A classical nonparametric regression estimator: $$f(\mathbf{x}*) = \frac{\sum{i=1}^n K_h(\mathbf{x}*, \mathbf{x}i) y_i}{\sum{i=1}^n K_h(\mathbf{x}*, \mathbf{x}_i)}$$

where $K_h$ is a kernel with bandwidth $h$.

Comparison with Kernel Ridge Regression:

Nadaraya-Watson

•Weights: Normalized kernel values
•Independence: Each prediction independent
•No training: Just a formula evaluation
•Consistency: As n → ∞, converges to true f
•Bias at boundaries: Systematic boundary errors

Kernel Ridge Regression

•Weights: (K + λI)⁻¹ transformed kernels
•Global: All predictions share α
•Training: Solve linear system once
•RKHS optimality: Minimizes regularized loss
•Better boundary behavior: Regularization helps

Key Difference: Correlation Correction

Nadaraya-Watson uses raw kernel weights: $$w_i^{NW}(\mathbf{x}_) \propto k(\mathbf{x}i, \mathbf{x})$$

Kernel ridge regression uses correlation-corrected weights: $$w_i^{KRR}(\mathbf{x}*) = [(\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{k}*]_i$$

The $(\mathbf{K} + \lambda\mathbf{I})^{-1}$ factor accounts for the fact that training points are correlated. If multiple training points are close together, they shouldn't each get full weight—they're providing redundant information.

Example:

Consider 10 training points clustered near $x = 0$ and 1 isolated point at $x = 5$:

Nadaraya-Watson at $x = 4$: Dominated by the cluster (10× more weight)
Kernel ridge at $x = 4$: Gives more weight to the isolated point (recognizes cluster points are redundant)

Kernel ridge regression produces more sensible extrapolation due to this correlation correction.

LOESS as Local Linear Fitting

LOESS fits local linear (or quadratic) models at each test point, weighted by kernel values. Kernel ridge regression with linear kernel achieves global linear regression. With nonlinear kernels, kernel ridge regression is more like global LOESS—fitting a single smooth function globally rather than local approximations that may not connect smoothly.

Confidence and Uncertainty Estimation

While kernel ridge regression gives point predictions, we often want confidence intervals or uncertainty estimates. There are several approaches.

1. Frequentist Confidence Intervals:

Under the assumption that $y_i = f(\mathbf{x}_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d., the prediction at a new point has variance:

$$\text{Var}(f(\mathbf{x}*)) = \sigma^2 \mathbf{k}^\top (\mathbf{K} + \lambda\mathbf{I})^{-2} \mathbf{k}_$$

The noise variance $\sigma^2$ can be estimated from residuals: $$\hat{\sigma}^2 = \frac{|\mathbf{y} - \hat{\mathbf{y}}|^2}{n - \text{trace}(\mathbf{H})}$$

A $(1-\alpha)$ confidence interval is approximately: $$f(\mathbf{x}*) \pm z{\alpha/2} \cdot \hat{\sigma} \sqrt{\mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-2} \mathbf{k}*}$$

Connection to Gaussian Processes

Kernel ridge regression has an exact correspondence with Gaussian Process regression. The KRR prediction equals the GP posterior mean, and the GP posterior variance provides a principled uncertainty estimate:

$$\text{Var}{GP}(f(\mathbf{x})) = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{k}_$$

(with λ = σ² as the noise variance). This GP interpretation is covered in Module 3.

2. Bootstrap Confidence Intervals:

A non-parametric approach:

Resample training data with replacement: $(\mathbf{X}^{(b)}, \mathbf{y}^{(b)})$
Refit kernel ridge regression to get $\boldsymbol{\alpha}^{(b)}$
Compute prediction $f^{(b)}(\mathbf{x}_*)$
Repeat $B$ times (typically $B = 1000$)
Use percentiles of ${f^{(b)}(\mathbf{x}_*)}$ as confidence bounds

3. Influence-Based Uncertainty:

Points with low similarity to training data should have higher uncertainty. A simple measure: $$\text{uncertainty}(\mathbf{x}_) \propto \frac{1}{\sum_i k(\mathbf{x}i, \mathbf{x})}$$

This is high when $\mathbf{x}_*$ is far from all training points.

4. Ensemble Methods:

Train multiple models with different:

Regularization parameters
Kernel hyperparameters
Random subsets of training data

Variance across ensemble predictions estimates uncertainty.

Practical Uncertainty Estimation

For production systems:

Use GP interpretation for calibrated uncertainty (requires careful noise variance estimation)
Augment with 'distance from training data' checks—flag predictions where max kernel value is below a threshold
Consider conformal prediction for distribution-free coverage guarantees

Efficient Multi-Point Prediction

In practice, we often need predictions for many test points $\mathbf{x}_1^, \ldots, \mathbf{x}_m^$. Efficiency matters.

Naive Approach:

For each test point independently:

Compute $\mathbf{k}_j^* = (k(\mathbf{x}_1, \mathbf{x}_j^), \ldots, k(\mathbf{x}_n, \mathbf{x}_j^))^\top$
Return $f(\mathbf{x}_j^) = \mathbf{k}_j^{\top} \boldsymbol{\alpha}^*$

Cost: $O(mn)$ kernel evaluations + $O(mn)$ for dot products = $O(mn)$ total

Matrix Form:

Define the test kernel matrix $\mathbf{K}* \in \mathbb{R}^{n \times m}$: $$(K)_{ij} = k(\mathbf{x}_i, \mathbf{x}_j^)$$

Then all predictions are: $$\mathbf{f}^* = \mathbf{K}_^\top \boldsymbol{\alpha}^ \in \mathbb{R}^m$$

This is a single matrix-vector product, enabling efficient batch prediction.

Computational Breakdown

Training phase (done once): • Build K: O(n²) kernel evaluations, O(n²d) time • Solve system: O(n³) for direct solve, less for iterative • Store α: O(n) space

Prediction phase (per batch of m points): • Build K*: O(nm) kernel evaluations, O(nmd) time • Matrix-vector product: O(nm) operations • Total: O(nmd) time

Prediction scales linearly in both n and m, making batch prediction efficient.

Optimization Strategies:

1. Precompute and cache $(\mathbf{K} + \lambda\mathbf{I})^{-1}$:

If predictions at training points or leave-one-out evaluations are needed, storing the full inverse (or its Cholesky factor) can save time.

2. Online prediction updates:

For streaming test data, incrementally compute kernel values and predictions without reforming $\mathbf{K}_*$.

3. Approximate nearest neighbor search:

For local kernels (like RBF), only nearby training points contribute significantly. Use KD-trees or ball trees to find the $k$ nearest training points and compute only those kernel values.

4. Low-rank approximations:

If $\mathbf{K} \approx \mathbf{U}\mathbf{U}^\top$ with $\mathbf{U} \in \mathbb{R}^{n \times r}$ ($r << n$), then: $$\boldsymbol{\alpha}^* \approx (\mathbf{U}\mathbf{U}^\top + \lambda\mathbf{I})^{-1}\mathbf{y}$$

Using Woodbury identity, this reduces to an $r \times r$ system. Prediction becomes $O(r)$ per test point. (Covered in Module 5.)

Prediction Complexity Comparison
Method	Training	Prediction (m points)	Space
Exact KRR	O(n³)	O(nmd)	O(n²)
Cholesky stored	O(n³)	O(nmd)	O(n²)
Rank-r approx	O(n²r)	O(rmd)	O(nr)
Random features	O(nr²)	O(rmd)	O(nr)
Nearest neighbor	O(n²)	O(kmd)	O(n)

Summary and What's Next

We've comprehensively explored how predictions work in kernel ridge regression. Let's consolidate the key insights:

Key Takeaways

•The prediction formula $f(\mathbf{x}*) = \mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$ expresses predictions as similarity-weighted combinations of training targets.
•Geometric interpretation: Prediction is a linear function in feature space—the inner product of learned weights with test features.
•The smoother matrix $\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$ transforms targets through spectral filtering, and its trace gives degrees of freedom.
•Regularization controls interpolation vs. smoothing: λ → 0 interpolates exactly; λ → ∞ gives the null prediction.
•Connection to local regression: KRR improves on Nadaraya-Watson by correcting for training point correlations.
•Uncertainty estimation via GP correspondence, bootstrap, or ensemble methods.
•Batch prediction via matrix operations scales as O(nmd) for m test points.

What's Next:

With prediction mechanics understood, we turn to Computational Complexity. We'll analyze:

Detailed complexity analysis of training and prediction
When kernel methods become prohibitively expensive
The n³ bottleneck and strategies to mitigate it
Memory considerations for large-scale problems
Trade-offs between exact and approximate methods

Page Complete

You now understand how kernel ridge regression makes predictions—from the mathematical formula to geometric intuition to practical considerations. This knowledge is essential for both implementing and interpreting kernel regression models.

3 / 5

Loading learning content...

Machine LearningKernel Ridge Regression

Kernel Ridge Regression

LevelAdvanced

Duration90 mins

TopicKernel Ridge Regression

3 / 5

Prediction with Kernels

From Training to Prediction

We've derived the dual formulation and studied the kernel matrix. Now comes the moment of truth: How do we use the trained model to make predictions on new, unseen data?

This page explores prediction in depth—from the basic formula to sophisticated interpretations that connect kernel ridge regression to smoothing, interpolation, and Gaussian processes.

What You Will Learn

The Prediction Formula

Given a new test point $\mathbf{x}_*$, the kernel ridge regression prediction is:

$$f(\mathbf{x}) = \mathbf{w}^{\top} \phi(\mathbf{x}*)$$

Using the representer theorem ($\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* \phi(\mathbf{x}_i)$), this becomes:

The Kernel Prediction Formula

$$f(\mathbf{x}*) = \sum{i=1}^n \alpha_i^* k(\mathbf{x}i, \mathbf{x}) = \mathbf{k}_^\top \boldsymbol{\alpha}^*$$

Substituting the solution for $\boldsymbol{\alpha}^*$:

$$f(\mathbf{x}*) = \mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-1} \mathbf{y}$$

Breaking Down the Components:

Kernel vector $\mathbf{k}_*$: Measures similarity between $\mathbf{x}*$ and each training point. Points with high similarity to $\mathbf{x}*$ will have larger kernel values.
Regularized inverse $(\mathbf{K} + \lambda\mathbf{I})^{-1}$: Transforms the kernel vector accounting for:
- Dependencies between training points (captured by $\mathbf{K}$)
- Regularization strength (captured by $\lambda$)
- This is the "correlation-aware" transformation
Target vector $\mathbf{y}$: The training targets we aim to predict

Computational Flow:

1. Training (once):      Compute K, solve (K + λI)α = y for α
2. Prediction (per x*):  Compute k*, return k*ᵀα

After training, predictions require only $O(n)$ kernel evaluations and an $O(n)$ dot product per test point.

The Prediction Weight Vector

Define the prediction weights $\mathbf{w}{\text{pred}}(\mathbf{x}) = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{k}_$. Then:

$$f(\mathbf{x}*) = \sum{i=1}^n w_i(\mathbf{x}*) \cdot y_i = \mathbf{w}{\text{pred}}^\top \mathbf{y}$$

This reveals prediction as a weighted average of training targets, where the weights depend on the test point x*. Different test points have different weight distributions over training points.

Geometric Interpretation in Feature Space

Understanding predictions geometrically deepens our intuition about what kernel ridge regression computes.

The Feature Space View:

Recall that the kernel corresponds to an inner product in feature space: $$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$

The prediction formula becomes: $$f(\mathbf{x}) = \langle \mathbf{w}^, \phi(\mathbf{x}*) \rangle$$

This is a linear function in feature space—the inner product of the learned weight vector with the test point's feature representation.

Interpretation:

$\phi(\mathbf{x}_*)$ is a vector in feature space $\mathcal{H}$
$\mathbf{w}^*$ is another vector in $\mathcal{H}$ (the model)
The prediction is the projection of $\phi(\mathbf{x}_)$ onto $\mathbf{w}^$

Even when $\mathcal{H}$ is infinite-dimensional, this inner product is well-defined and computable via the kernel.

The Representer Geometry

Since $\mathbf{w}^* = \sum_i \alpha_i^* \phi(\mathbf{x}_i)$, the model weight vector is a linear combination of training point feature vectors. Geometrically:

Understanding Through the Kernel Trick:

$$f(\mathbf{x}) = \langle \mathbf{w}^, \phi(\mathbf{x}) \rangle = \left\langle \sum_{i=1}^n \alpha_i^ \phi(\mathbf{x}i), \phi(\mathbf{x}*) \right\rangle$$

$$= \sum_{i=1}^n \alpha_i^* \langle \phi(\mathbf{x}i), \phi(\mathbf{x}) \rangle = \sum_{i=1}^n \alpha_i^ k(\mathbf{x}i, \mathbf{x}*)$$

Each training point $i$ contributes $\alpha_i^* k(\mathbf{x}i, \mathbf{x}*)$ to the prediction:

$\alpha_i^*$: the "importance" of point $i$ determined by training
$k(\mathbf{x}i, \mathbf{x}*)$: similarity between training point $i$ and test point

Similarity-Weighted Voting:

Think of prediction as each training point "voting" for a value, with votes weighted by:

How important that training point is (captured in $\alpha_i^*$)
How similar the test point is to that training point (captured in $k(\mathbf{x}i, \mathbf{x}*)$)

Training Point Influence

•High |αᵢ|: Point i is 'surprising' or influential
•Small |αᵢ|: Point i is 'predictable' from others
•αᵢ > 0: Point i 'pushes' predictions upward
•αᵢ < 0: Point i 'pulls' predictions downward
•αᵢ ≈ 0: Point i has little effect on predictions

Similarity Weighting

•k(xᵢ, x) large*: xᵢ is similar to x*
•k(xᵢ, x) small*: xᵢ is far from x*
•RBF: Exponential decay with distance
•Polynomial: Similarity via inner product
•Local kernels: Only nearby points contribute

The Smoother (Hat) Matrix

When predicting at training points, the predictions take a special form that reveals important structure.

Definition:

The smoother matrix (or hat matrix) $\mathbf{H}$ is defined as: $$\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$$

The vector of predictions at training points is: $$\hat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}^* = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y} = \mathbf{H}\mathbf{y}$$

The name "hat matrix" comes from putting a "hat" on $\mathbf{y}$ to get predictions: $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$.

Properties of the Smoother Matrix

The hat matrix H = K(K + λI)⁻¹ has remarkable properties:

Symmetry: H = Hᵀ (since K is symmetric)
Eigenvalues: λᵢ(H) = λᵢ(K)/(λᵢ(K) + λ) ∈ [0, 1)
Trace (degrees of freedom): tr(H) = Σᵢ λᵢ(K)/(λᵢ(K) + λ)
Limiting cases:
- λ → 0: H → I (interpolation)
- λ → ∞: H → 0 (null prediction)
Idempotency failure: H² ≠ H in general (unlike OLS hat matrix when λ = 0)

Interpretation as Smoothing:

The smoother matrix transforms targets $\mathbf{y}$ into predictions $\hat{\mathbf{y}}$. This transformation:

Shrinks amplitudes: Eigenvalues of $\mathbf{H}$ are in $[0, 1)$, so $|\hat{\mathbf{y}}| \leq |\mathbf{y}|$
Filters high-frequency components: Small eigenvalues of $\mathbf{K}$ (corresponding to "noisy" directions) are heavily suppressed by the ratio $\lambda_i / (\lambda_i + \lambda)$
Preserves low-frequency components: Large eigenvalues of $\mathbf{K}$ (corresponding to "smooth" patterns) are nearly preserved

The Degrees of Freedom:

$$\text{df}(\lambda) = \text{trace}(\mathbf{H}) = \sum_{i=1}^n \frac{\lambda_i(\mathbf{K})}{\lambda_i(\mathbf{K}) + \lambda}$$

This measures the effective model complexity:

df → n as $\lambda \to 0$ (complex, interpolating model)
df → 0 as $\lambda \to \infty$ (simple, shrunk model)

The degrees of freedom is crucial for model comparison and cross-validation.

Leverage and Influence

The diagonal element $H_{ii}$ is called the leverage of point $i$. It measures how much the prediction at xᵢ depends on yᵢ itself:

• High leverage (Hᵢᵢ ≈ 1): Prediction at xᵢ ≈ yᵢ (point has high self-influence) • Low leverage (Hᵢᵢ ≈ 0): Prediction at xᵢ depends mainly on other points

Points with high leverage are potential outliers or points in regions with few neighbors.

Interpolation vs. Smoothing: The Role of λ

The regularization parameter $\lambda$ controls the fundamental trade-off between fitting the data exactly and producing smooth predictions.

Limit Case 1: Pure Interpolation (λ → 0)

As $\lambda \to 0$: $$\boldsymbol{\alpha}^* \to \mathbf{K}^{-1}\mathbf{y}$$ $$\mathbf{H} \to \mathbf{I}$$

Predictions at training points become $\hat{\mathbf{y}} = \mathbf{y}$—exact interpolation. The model passes exactly through each training point.

Consequences of interpolation:

Zero training error
Maximum complexity (df = n)
Sensitive to noise (each yᵢ is perfectly reproduced, including noise)
Potentially poor generalization
Numerically problematic if K is ill-conditioned

The Runge Phenomenon Connection

Limit Case 2: Maximum Smoothing (λ → ∞)

As $\lambda \to \infty$: $$\boldsymbol{\alpha}^* \to \mathbf{0}$$ $$\mathbf{H} \to \mathbf{0}$$

Predictions become $\hat{\mathbf{y}} \to \mathbf{0}$—the trivial zero function.

Consequences of over-smoothing:

High training error (model ignores the data)
Minimum complexity (df → 0)
Underfitting: fails to capture any structure
Numerically trivial but useless

The Goldilocks Zone:

Optimal $\lambda$ balances these extremes:

Small enough $\lambda$: captures genuine patterns in the data
Large enough $\lambda$: filters out noise and ensures stability

Effect of λ on Kernel Ridge Regression
λ Value	Behavior	Training Error	df	Risk
→ 0	Interpolation	Zero	n	Overfitting, instability
Small	Flexible fit	Low	High	Possible overfitting
Optimal	Balanced	Moderate	Moderate	Best generalization
Large	Smooth fit	High	Low	Possible underfitting
→ ∞	Null model	Maximum	0	Complete underfitting

Practical Rule of Thumb

Connection to Local Regression

Kernel ridge regression shares deep connections with local regression methods like LOESS (Local Regression) and Nadaraya-Watson estimation.

The Nadaraya-Watson Estimator:

A classical nonparametric regression estimator: $$f(\mathbf{x}*) = \frac{\sum{i=1}^n K_h(\mathbf{x}*, \mathbf{x}i) y_i}{\sum{i=1}^n K_h(\mathbf{x}*, \mathbf{x}_i)}$$

where $K_h$ is a kernel with bandwidth $h$.

Comparison with Kernel Ridge Regression:

Nadaraya-Watson

•Weights: Normalized kernel values
•Independence: Each prediction independent
•No training: Just a formula evaluation
•Consistency: As n → ∞, converges to true f
•Bias at boundaries: Systematic boundary errors

Kernel Ridge Regression

•Weights: (K + λI)⁻¹ transformed kernels
•Global: All predictions share α
•Training: Solve linear system once
•RKHS optimality: Minimizes regularized loss
•Better boundary behavior: Regularization helps

Key Difference: Correlation Correction

Nadaraya-Watson uses raw kernel weights: $$w_i^{NW}(\mathbf{x}_) \propto k(\mathbf{x}i, \mathbf{x})$$

Kernel ridge regression uses correlation-corrected weights: $$w_i^{KRR}(\mathbf{x}*) = [(\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{k}*]_i$$

Example:

Consider 10 training points clustered near $x = 0$ and 1 isolated point at $x = 5$:

Nadaraya-Watson at $x = 4$: Dominated by the cluster (10× more weight)
Kernel ridge at $x = 4$: Gives more weight to the isolated point (recognizes cluster points are redundant)

Kernel ridge regression produces more sensible extrapolation due to this correlation correction.

LOESS as Local Linear Fitting

Confidence and Uncertainty Estimation

While kernel ridge regression gives point predictions, we often want confidence intervals or uncertainty estimates. There are several approaches.

1. Frequentist Confidence Intervals:

Under the assumption that $y_i = f(\mathbf{x}_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d., the prediction at a new point has variance:

$$\text{Var}(f(\mathbf{x}*)) = \sigma^2 \mathbf{k}^\top (\mathbf{K} + \lambda\mathbf{I})^{-2} \mathbf{k}_$$

The noise variance $\sigma^2$ can be estimated from residuals: $$\hat{\sigma}^2 = \frac{|\mathbf{y} - \hat{\mathbf{y}}|^2}{n - \text{trace}(\mathbf{H})}$$

A $(1-\alpha)$ confidence interval is approximately: $$f(\mathbf{x}*) \pm z{\alpha/2} \cdot \hat{\sigma} \sqrt{\mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-2} \mathbf{k}*}$$

Connection to Gaussian Processes

$$\text{Var}{GP}(f(\mathbf{x})) = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{k}_$$

(with λ = σ² as the noise variance). This GP interpretation is covered in Module 3.

2. Bootstrap Confidence Intervals:

A non-parametric approach:

Resample training data with replacement: $(\mathbf{X}^{(b)}, \mathbf{y}^{(b)})$
Refit kernel ridge regression to get $\boldsymbol{\alpha}^{(b)}$
Compute prediction $f^{(b)}(\mathbf{x}_*)$
Repeat $B$ times (typically $B = 1000$)
Use percentiles of ${f^{(b)}(\mathbf{x}_*)}$ as confidence bounds

3. Influence-Based Uncertainty:

Points with low similarity to training data should have higher uncertainty. A simple measure: $$\text{uncertainty}(\mathbf{x}_) \propto \frac{1}{\sum_i k(\mathbf{x}i, \mathbf{x})}$$

This is high when $\mathbf{x}_*$ is far from all training points.

4. Ensemble Methods:

Train multiple models with different:

Regularization parameters
Kernel hyperparameters
Random subsets of training data

Variance across ensemble predictions estimates uncertainty.

Practical Uncertainty Estimation

For production systems:

Use GP interpretation for calibrated uncertainty (requires careful noise variance estimation)
Augment with 'distance from training data' checks—flag predictions where max kernel value is below a threshold
Consider conformal prediction for distribution-free coverage guarantees

Efficient Multi-Point Prediction

In practice, we often need predictions for many test points $\mathbf{x}_1^, \ldots, \mathbf{x}_m^$. Efficiency matters.

Naive Approach:

For each test point independently:

Compute $\mathbf{k}_j^* = (k(\mathbf{x}_1, \mathbf{x}_j^), \ldots, k(\mathbf{x}_n, \mathbf{x}_j^))^\top$
Return $f(\mathbf{x}_j^) = \mathbf{k}_j^{\top} \boldsymbol{\alpha}^*$

Cost: $O(mn)$ kernel evaluations + $O(mn)$ for dot products = $O(mn)$ total

Matrix Form:

Define the test kernel matrix $\mathbf{K}* \in \mathbb{R}^{n \times m}$: $$(K)_{ij} = k(\mathbf{x}_i, \mathbf{x}_j^)$$

Then all predictions are: $$\mathbf{f}^* = \mathbf{K}_^\top \boldsymbol{\alpha}^ \in \mathbb{R}^m$$

This is a single matrix-vector product, enabling efficient batch prediction.

Computational Breakdown

Training phase (done once): • Build K: O(n²) kernel evaluations, O(n²d) time • Solve system: O(n³) for direct solve, less for iterative • Store α: O(n) space

Prediction phase (per batch of m points): • Build K*: O(nm) kernel evaluations, O(nmd) time • Matrix-vector product: O(nm) operations • Total: O(nmd) time

Prediction scales linearly in both n and m, making batch prediction efficient.

Optimization Strategies:

1. Precompute and cache $(\mathbf{K} + \lambda\mathbf{I})^{-1}$:

If predictions at training points or leave-one-out evaluations are needed, storing the full inverse (or its Cholesky factor) can save time.

2. Online prediction updates:

For streaming test data, incrementally compute kernel values and predictions without reforming $\mathbf{K}_*$.

3. Approximate nearest neighbor search:

For local kernels (like RBF), only nearby training points contribute significantly. Use KD-trees or ball trees to find the $k$ nearest training points and compute only those kernel values.

4. Low-rank approximations:

Using Woodbury identity, this reduces to an $r \times r$ system. Prediction becomes $O(r)$ per test point. (Covered in Module 5.)

Prediction Complexity Comparison
Method	Training	Prediction (m points)	Space
Exact KRR	O(n³)	O(nmd)	O(n²)
Cholesky stored	O(n³)	O(nmd)	O(n²)
Rank-r approx	O(n²r)	O(rmd)	O(nr)
Random features	O(nr²)	O(rmd)	O(nr)
Nearest neighbor	O(n²)	O(kmd)	O(n)

Summary and What's Next

We've comprehensively explored how predictions work in kernel ridge regression. Let's consolidate the key insights:

Key Takeaways

•The prediction formula $f(\mathbf{x}*) = \mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$ expresses predictions as similarity-weighted combinations of training targets.
•Geometric interpretation: Prediction is a linear function in feature space—the inner product of learned weights with test features.
•The smoother matrix $\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$ transforms targets through spectral filtering, and its trace gives degrees of freedom.
•Regularization controls interpolation vs. smoothing: λ → 0 interpolates exactly; λ → ∞ gives the null prediction.
•Connection to local regression: KRR improves on Nadaraya-Watson by correcting for training point correlations.
•Uncertainty estimation via GP correspondence, bootstrap, or ensemble methods.
•Batch prediction via matrix operations scales as O(nmd) for m test points.

What's Next:

With prediction mechanics understood, we turn to Computational Complexity. We'll analyze:

Detailed complexity analysis of training and prediction
When kernel methods become prohibitively expensive
The n³ bottleneck and strategies to mitigate it
Memory considerations for large-scale problems
Trade-offs between exact and approximate methods

Page Complete

3 / 5