Loading learning content...
We've derived the dual formulation and studied the kernel matrix. Now comes the moment of truth: How do we use the trained model to make predictions on new, unseen data?
In kernel ridge regression, prediction takes a remarkably elegant form. Rather than explicitly computing weights in a high-dimensional feature space, we express predictions as weighted combinations of kernel evaluations between the test point and training points. This not only avoids the computational burden of feature space but also provides rich geometric and probabilistic interpretations.
This page explores prediction in depth—from the basic formula to sophisticated interpretations that connect kernel ridge regression to smoothing, interpolation, and Gaussian processes.
By the end of this page, you will: • Master the kernel prediction formula and understand each component • Visualize predictions geometrically in feature space • Interpret predictions as weighted averages with kernel weights • Understand the smoother matrix (hat matrix) and its properties • Connect kernel predictions to local regression and smoothing • Explore confidence and influence of training points • Handle multiple test points efficiently
Given a new test point $\mathbf{x}_*$, the kernel ridge regression prediction is:
$$f(\mathbf{x}) = \mathbf{w}^{\top} \phi(\mathbf{x}*)$$
Using the representer theorem ($\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* \phi(\mathbf{x}_i)$), this becomes:
$$f(\mathbf{x}*) = \sum{i=1}^n \alpha_i^* k(\mathbf{x}i, \mathbf{x}) = \mathbf{k}_^\top \boldsymbol{\alpha}^*$$
where: • $\mathbf{k}_* = (k(\mathbf{x}1, \mathbf{x}), \ldots, k(\mathbf{x}n, \mathbf{x}))^\top$ is the vector of kernel evaluations • $\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$ are the trained dual coefficients
Substituting the solution for $\boldsymbol{\alpha}^*$:
$$f(\mathbf{x}*) = \mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-1} \mathbf{y}$$
Breaking Down the Components:
Kernel vector $\mathbf{k}_*$: Measures similarity between $\mathbf{x}*$ and each training point. Points with high similarity to $\mathbf{x}*$ will have larger kernel values.
Regularized inverse $(\mathbf{K} + \lambda\mathbf{I})^{-1}$: Transforms the kernel vector accounting for:
Target vector $\mathbf{y}$: The training targets we aim to predict
Computational Flow:
1. Training (once): Compute K, solve (K + λI)α = y for α
2. Prediction (per x*): Compute k*, return k*ᵀα
After training, predictions require only $O(n)$ kernel evaluations and an $O(n)$ dot product per test point.
Define the prediction weights $\mathbf{w}{\text{pred}}(\mathbf{x}) = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{k}_$. Then:
$$f(\mathbf{x}*) = \sum{i=1}^n w_i(\mathbf{x}*) \cdot y_i = \mathbf{w}{\text{pred}}^\top \mathbf{y}$$
This reveals prediction as a weighted average of training targets, where the weights depend on the test point x*. Different test points have different weight distributions over training points.
Understanding predictions geometrically deepens our intuition about what kernel ridge regression computes.
The Feature Space View:
Recall that the kernel corresponds to an inner product in feature space: $$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$
The prediction formula becomes: $$f(\mathbf{x}) = \langle \mathbf{w}^, \phi(\mathbf{x}*) \rangle$$
This is a linear function in feature space—the inner product of the learned weight vector with the test point's feature representation.
Interpretation:
Even when $\mathcal{H}$ is infinite-dimensional, this inner product is well-defined and computable via the kernel.
Since $\mathbf{w}^* = \sum_i \alpha_i^* \phi(\mathbf{x}_i)$, the model weight vector is a linear combination of training point feature vectors. Geometrically:
• The training points $\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}n)$ span an at most n-dimensional subspace of $\mathcal{H}$ • The optimal $\mathbf{w}^*$ lies in this subspace • Prediction projects $\phi(\mathbf{x})$ onto $\mathbf{w}^$ within this subspace
Understanding Through the Kernel Trick:
$$f(\mathbf{x}) = \langle \mathbf{w}^, \phi(\mathbf{x}) \rangle = \left\langle \sum_{i=1}^n \alpha_i^ \phi(\mathbf{x}i), \phi(\mathbf{x}*) \right\rangle$$
$$= \sum_{i=1}^n \alpha_i^* \langle \phi(\mathbf{x}i), \phi(\mathbf{x}) \rangle = \sum_{i=1}^n \alpha_i^ k(\mathbf{x}i, \mathbf{x}*)$$
Each training point $i$ contributes $\alpha_i^* k(\mathbf{x}i, \mathbf{x}*)$ to the prediction:
Similarity-Weighted Voting:
Think of prediction as each training point "voting" for a value, with votes weighted by:
When predicting at training points, the predictions take a special form that reveals important structure.
Definition:
The smoother matrix (or hat matrix) $\mathbf{H}$ is defined as: $$\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$$
The vector of predictions at training points is: $$\hat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}^* = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y} = \mathbf{H}\mathbf{y}$$
The name "hat matrix" comes from putting a "hat" on $\mathbf{y}$ to get predictions: $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$.
The hat matrix H = K(K + λI)⁻¹ has remarkable properties:
Interpretation as Smoothing:
The smoother matrix transforms targets $\mathbf{y}$ into predictions $\hat{\mathbf{y}}$. This transformation:
Shrinks amplitudes: Eigenvalues of $\mathbf{H}$ are in $[0, 1)$, so $|\hat{\mathbf{y}}| \leq |\mathbf{y}|$
Filters high-frequency components: Small eigenvalues of $\mathbf{K}$ (corresponding to "noisy" directions) are heavily suppressed by the ratio $\lambda_i / (\lambda_i + \lambda)$
Preserves low-frequency components: Large eigenvalues of $\mathbf{K}$ (corresponding to "smooth" patterns) are nearly preserved
The Degrees of Freedom:
$$\text{df}(\lambda) = \text{trace}(\mathbf{H}) = \sum_{i=1}^n \frac{\lambda_i(\mathbf{K})}{\lambda_i(\mathbf{K}) + \lambda}$$
This measures the effective model complexity:
The degrees of freedom is crucial for model comparison and cross-validation.
The diagonal element $H_{ii}$ is called the leverage of point $i$. It measures how much the prediction at xᵢ depends on yᵢ itself:
• High leverage (Hᵢᵢ ≈ 1): Prediction at xᵢ ≈ yᵢ (point has high self-influence) • Low leverage (Hᵢᵢ ≈ 0): Prediction at xᵢ depends mainly on other points
Points with high leverage are potential outliers or points in regions with few neighbors.
The regularization parameter $\lambda$ controls the fundamental trade-off between fitting the data exactly and producing smooth predictions.
Limit Case 1: Pure Interpolation (λ → 0)
As $\lambda \to 0$: $$\boldsymbol{\alpha}^* \to \mathbf{K}^{-1}\mathbf{y}$$ $$\mathbf{H} \to \mathbf{I}$$
Predictions at training points become $\hat{\mathbf{y}} = \mathbf{y}$—exact interpolation. The model passes exactly through each training point.
Consequences of interpolation:
Pure interpolation with kernels can exhibit Runge-like phenomena: wild oscillations between data points, especially with RBF kernels at small bandwidths. The prediction may look perfect at training points but behave erratically in between. Regularization is essential for stable, generalizable predictions.
Limit Case 2: Maximum Smoothing (λ → ∞)
As $\lambda \to \infty$: $$\boldsymbol{\alpha}^* \to \mathbf{0}$$ $$\mathbf{H} \to \mathbf{0}$$
Predictions become $\hat{\mathbf{y}} \to \mathbf{0}$—the trivial zero function.
Consequences of over-smoothing:
The Goldilocks Zone:
Optimal $\lambda$ balances these extremes:
| λ Value | Behavior | Training Error | df | Risk |
|---|---|---|---|---|
| → 0 | Interpolation | Zero | n | Overfitting, instability |
| Small | Flexible fit | Low | High | Possible overfitting |
| Optimal | Balanced | Moderate | Moderate | Best generalization |
| Large | Smooth fit | High | Low | Possible underfitting |
| → ∞ | Null model | Maximum | 0 | Complete underfitting |
Start with λ around the median eigenvalue of K, then tune using cross-validation. If λ is much smaller than the smallest eigenvalue, expect numerical issues. If λ is much larger than the largest eigenvalue, expect severe underfitting.
Kernel ridge regression shares deep connections with local regression methods like LOESS (Local Regression) and Nadaraya-Watson estimation.
The Nadaraya-Watson Estimator:
A classical nonparametric regression estimator: $$f(\mathbf{x}*) = \frac{\sum{i=1}^n K_h(\mathbf{x}*, \mathbf{x}i) y_i}{\sum{i=1}^n K_h(\mathbf{x}*, \mathbf{x}_i)}$$
where $K_h$ is a kernel with bandwidth $h$.
Comparison with Kernel Ridge Regression:
Key Difference: Correlation Correction
Nadaraya-Watson uses raw kernel weights: $$w_i^{NW}(\mathbf{x}_) \propto k(\mathbf{x}i, \mathbf{x})$$
Kernel ridge regression uses correlation-corrected weights: $$w_i^{KRR}(\mathbf{x}*) = [(\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{k}*]_i$$
The $(\mathbf{K} + \lambda\mathbf{I})^{-1}$ factor accounts for the fact that training points are correlated. If multiple training points are close together, they shouldn't each get full weight—they're providing redundant information.
Example:
Consider 10 training points clustered near $x = 0$ and 1 isolated point at $x = 5$:
Kernel ridge regression produces more sensible extrapolation due to this correlation correction.
LOESS fits local linear (or quadratic) models at each test point, weighted by kernel values. Kernel ridge regression with linear kernel achieves global linear regression. With nonlinear kernels, kernel ridge regression is more like global LOESS—fitting a single smooth function globally rather than local approximations that may not connect smoothly.
While kernel ridge regression gives point predictions, we often want confidence intervals or uncertainty estimates. There are several approaches.
1. Frequentist Confidence Intervals:
Under the assumption that $y_i = f(\mathbf{x}_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d., the prediction at a new point has variance:
$$\text{Var}(f(\mathbf{x}*)) = \sigma^2 \mathbf{k}^\top (\mathbf{K} + \lambda\mathbf{I})^{-2} \mathbf{k}_$$
The noise variance $\sigma^2$ can be estimated from residuals: $$\hat{\sigma}^2 = \frac{|\mathbf{y} - \hat{\mathbf{y}}|^2}{n - \text{trace}(\mathbf{H})}$$
A $(1-\alpha)$ confidence interval is approximately: $$f(\mathbf{x}*) \pm z{\alpha/2} \cdot \hat{\sigma} \sqrt{\mathbf{k}*^\top (\mathbf{K} + \lambda\mathbf{I})^{-2} \mathbf{k}*}$$
Kernel ridge regression has an exact correspondence with Gaussian Process regression. The KRR prediction equals the GP posterior mean, and the GP posterior variance provides a principled uncertainty estimate:
$$\text{Var}{GP}(f(\mathbf{x})) = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{k}_$$
(with λ = σ² as the noise variance). This GP interpretation is covered in Module 3.
2. Bootstrap Confidence Intervals:
A non-parametric approach:
3. Influence-Based Uncertainty:
Points with low similarity to training data should have higher uncertainty. A simple measure: $$\text{uncertainty}(\mathbf{x}_) \propto \frac{1}{\sum_i k(\mathbf{x}i, \mathbf{x})}$$
This is high when $\mathbf{x}_*$ is far from all training points.
4. Ensemble Methods:
Train multiple models with different:
Variance across ensemble predictions estimates uncertainty.
For production systems:
In practice, we often need predictions for many test points $\mathbf{x}_1^, \ldots, \mathbf{x}_m^$. Efficiency matters.
Naive Approach:
For each test point independently:
Cost: $O(mn)$ kernel evaluations + $O(mn)$ for dot products = $O(mn)$ total
Matrix Form:
Define the test kernel matrix $\mathbf{K}* \in \mathbb{R}^{n \times m}$: $$(K)_{ij} = k(\mathbf{x}_i, \mathbf{x}_j^)$$
Then all predictions are: $$\mathbf{f}^* = \mathbf{K}_^\top \boldsymbol{\alpha}^ \in \mathbb{R}^m$$
This is a single matrix-vector product, enabling efficient batch prediction.
Training phase (done once): • Build K: O(n²) kernel evaluations, O(n²d) time • Solve system: O(n³) for direct solve, less for iterative • Store α: O(n) space
Prediction phase (per batch of m points): • Build K*: O(nm) kernel evaluations, O(nmd) time • Matrix-vector product: O(nm) operations • Total: O(nmd) time
Prediction scales linearly in both n and m, making batch prediction efficient.
Optimization Strategies:
1. Precompute and cache $(\mathbf{K} + \lambda\mathbf{I})^{-1}$:
If predictions at training points or leave-one-out evaluations are needed, storing the full inverse (or its Cholesky factor) can save time.
2. Online prediction updates:
For streaming test data, incrementally compute kernel values and predictions without reforming $\mathbf{K}_*$.
3. Approximate nearest neighbor search:
For local kernels (like RBF), only nearby training points contribute significantly. Use KD-trees or ball trees to find the $k$ nearest training points and compute only those kernel values.
4. Low-rank approximations:
If $\mathbf{K} \approx \mathbf{U}\mathbf{U}^\top$ with $\mathbf{U} \in \mathbb{R}^{n \times r}$ ($r << n$), then: $$\boldsymbol{\alpha}^* \approx (\mathbf{U}\mathbf{U}^\top + \lambda\mathbf{I})^{-1}\mathbf{y}$$
Using Woodbury identity, this reduces to an $r \times r$ system. Prediction becomes $O(r)$ per test point. (Covered in Module 5.)
| Method | Training | Prediction (m points) | Space |
|---|---|---|---|
| Exact KRR | O(n³) | O(nmd) | O(n²) |
| Cholesky stored | O(n³) | O(nmd) | O(n²) |
| Rank-r approx | O(n²r) | O(rmd) | O(nr) |
| Random features | O(nr²) | O(rmd) | O(nr) |
| Nearest neighbor | O(n²) | O(kmd) | O(n) |
We've comprehensively explored how predictions work in kernel ridge regression. Let's consolidate the key insights:
What's Next:
With prediction mechanics understood, we turn to Computational Complexity. We'll analyze:
You now understand how kernel ridge regression makes predictions—from the mathematical formula to geometric intuition to practical considerations. This knowledge is essential for both implementing and interpreting kernel regression models.