Loading content...
The kernel function is the single most important design choice in kernel ridge regression. It implicitly defines the feature space, determines what functions can be learned, and encodes our assumptions about the underlying structure of the problem.
Choosing the wrong kernel can make learning impossible—no amount of data will help if your kernel cannot represent the true relationship. Choosing the right kernel can make difficult problems tractable and enable surprisingly accurate predictions with limited data.
This page develops the theory and practice of kernel selection: understanding what different kernels encode, matching kernels to problem characteristics, tuning kernel hyperparameters, and developing intuition for when each kernel family is appropriate.
By the end of this page, you will: • Understand the properties encoded by different kernel families • Match kernels to problem characteristics (smoothness, periodicity, etc.) • Tune kernel hyperparameters effectively • Combine kernels for complex problems • Develop practical kernel selection strategies • Recognize common pitfalls and how to avoid them
Every kernel encodes assumptions about the similarity structure of the problem. These assumptions determine what functions can be well-approximated and how data is interpolated.
The Kernel as Prior:
From a probabilistic perspective (Gaussian Processes view), the kernel specifies a prior distribution over functions. The kernel $k(\mathbf{x}, \mathbf{x}')$ gives the prior covariance between function values at $\mathbf{x}$ and $\mathbf{x}'$: $$\text{Cov}(f(\mathbf{x}), f(\mathbf{x}')) = k(\mathbf{x}, \mathbf{x}')$$
Key Properties Encoded:
The kernel implicitly maps inputs to a feature space where linear relationships capture the desired nonlinearity:
• Linear kernel → linear functions in original space • Polynomial kernel → polynomial functions • RBF kernel → arbitrarily smooth functions (infinite-dimensional feature space)
Choosing a kernel is choosing what 'patterns' can be detected.
Stationarity:
A kernel is stationary if it depends only on the difference $\mathbf{x} - \mathbf{x}'$: $$k(\mathbf{x}, \mathbf{x}') = k_s(\mathbf{x} - \mathbf{x}')$$
Stationary kernels assume that the smoothness and variability of the function are the same everywhere in input space.
Stationary kernels: RBF, Matérn, periodic Non-stationary kernels: Linear, polynomial, neural network kernel
Isotropy:
An isotropic kernel depends only on the distance $|\mathbf{x} - \mathbf{x}'|$: $$k(\mathbf{x}, \mathbf{x}') = k_r(|\mathbf{x} - \mathbf{x}'|)$$
Isotropic kernels treat all input directions identically.
Consequence: If some input dimensions are more important, isotropic kernels may perform poorly—consider ARD (Automatic Relevance Determination) variants.
Let's systematically examine the most important kernel families.
1. Linear Kernel
$$k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}' = \sum_{i=1}^d x_i x_i'$$
Properties:
Regularized:
2. Polynomial Kernel
$$k(\mathbf{x}, \mathbf{x}') = (\gamma \mathbf{x}^\top \mathbf{x}' + c)^p$$
Properties:
$$k(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right) = \exp(-\gamma|\mathbf{x} - \mathbf{x}'|^2)$$
where $\ell$ is the length scale and $\gamma = 1/(2\ell^2)$.
Properties: • Infinite-dimensional feature space • Functions are infinitely differentiable (very smooth) • Universal approximator (can learn any continuous function given enough data) • Single hyperparameter: length scale ℓ • Most common default choice
Effect of RBF Length Scale:
| ℓ (length scale) | Behavior |
|---|---|
| Very small | Only nearby points similar; functions highly wiggly; risk of overfitting |
| Small | Local fits; captures fine detail |
| Moderate | Balanced smoothness; typical good choice |
| Large | Points far apart still similar; very smooth functions |
| Very large | Approaches constant function; underfitting |
4. Matérn Kernel Family
$$k(\mathbf{x}, \mathbf{x}') = \frac{2^{1- u}}{\Gamma( u)}\left(\sqrt{2 u}\frac{r}{\ell}\right)^ u K_ u\left(\sqrt{2 u}\frac{r}{\ell}\right)$$
where $r = |\mathbf{x} - \mathbf{x}'|$, $K_ u$ is the modified Bessel function, and $ u$ controls smoothness.
Key cases:
Use when: You need to control smoothness explicitly; physical processes often have finite smoothness.
| Kernel | Smoothness | Hyperparameters | Best Use Case |
|---|---|---|---|
| Linear | Piece-wise linear only | None (or offset) | Linear relationships, baselines |
| Polynomial | Polynomial-smooth | degree, γ, c | Low-degree nonlinearity |
| RBF | Infinitely smooth | length scale ℓ | General nonlinear (default) |
| Matérn-1/2 | Continuous, not diff. | length scale ℓ | Rough functions |
| Matérn-3/2 | Once differentiable | length scale ℓ | Moderately smooth |
| Matérn-5/2 | Twice differentiable | length scale ℓ | Physical processes |
| Periodic | Periodic + smooth | period, length scale | Seasonal data |
| Rational Quadratic | Scale mixture | α, length scale | Multi-scale patterns |
Standard isotropic kernels use a single length scale for all input dimensions. But often different features have different relevance.
ARD Kernels:
Introduce a separate length scale $\ell_j$ for each input dimension $j$:
$$k_{\text{ARD}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}\sum_{j=1}^d \frac{(x_j - x_j')^2}{\ell_j^2}\right)$$
Equivalently, define $\mathbf{\Lambda} = \text{diag}(\ell_1^{-2}, \ldots, \ell_d^{-2})$: $$k_{\text{ARD}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}(\mathbf{x} - \mathbf{x}')^\top \mathbf{\Lambda} (\mathbf{x} - \mathbf{x}')\right)$$
Large ℓⱼ → Dimension j is 'stretched'; points differ little in this dimension → feature j is irrelevant Small ℓⱼ → Dimension j is 'compressed'; small differences matter → feature j is highly relevant
By optimizing length scales on validation data, ARD automatically discovers which features matter. Dimensions with ℓⱼ → ∞ are effectively ignored.
Trade-offs of ARD:
Advantages:
Disadvantages:
When to Use ARD:
When to Avoid:
Complex problems often require combining multiple kernels to capture different aspects of the data. Fortunately, several operations preserve positive definiteness.
Kernel Arithmetic:
1. Sum of Kernels: $$k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$$
Interpretation: Functions are sums of functions from each component space. Use: When you expect additive effects from different sources.
2. Product of Kernels: $$k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$$
Interpretation: Feature space is tensor product; functions have multiplicative interaction structure. Use: When effects from different aspects interact multiplicatively.
3. Scaling: $$k(\mathbf{x}, \mathbf{x}') = c \cdot k_1(\mathbf{x}, \mathbf{x}'), \quad c > 0$$
Interpretation: Scales the variance of the function component.
For time series with a trend and seasonal pattern:
$$k(t, t') = \underbrace{k_{\text{linear}}(t, t')}{\text{trend}} + \underbrace{k{\text{periodic}}(t, t')}{\text{seasonality}} + \underbrace{k{\text{RBF}}(t, t')}_{\text{smooth residual}}$$
Each component captures a different aspect: • Linear: long-term trend • Periodic: recurring seasonal patterns • RBF: smooth deviations from trend+season
Kernels on Multiple Input Types:
When inputs have multiple components (e.g., spatial location $\mathbf{s}$ and time $t$):
Option 1: Independent effects (sum): $$k((\mathbf{s}, t), (\mathbf{s}', t')) = k_s(\mathbf{s}, \mathbf{s}') + k_t(t, t')$$
Option 2: Interacting effects (product): $$k((\mathbf{s}, t), (\mathbf{s}', t')) = k_s(\mathbf{s}, \mathbf{s}') \cdot k_t(t, t')$$
Option 3: Mixed (sum of products): $$k = k_s + k_t + k_s \cdot k_t$$
The Periodic Kernel:
$$k_{\text{periodic}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{2\sin^2\left(\pi|x - x'|/p\right)}{\ell^2}\right)$$
where $p$ is the period. This captures exactly periodic functions.
Use: Time series with known periodicity (daily, weekly, yearly cycles).
| Operation | Formula | Effect | Typical Use |
|---|---|---|---|
| Sum | k₁ + k₂ | Additive effects | Multiple independent signals |
| Product | k₁ × k₂ | Multiplicative interaction | Modulating effects |
| Scale | c × k | Variance scaling | Weighting components |
| Polynomial | (k + c)ᵖ | Higher-order features | Boosting complexity |
| Compose | k(g(x), g(x')) | Input warping | Non-uniform relevance |
Most kernels have hyperparameters (length scales, degrees, etc.) that must be tuned. The right hyperparameters can make the difference between a useless model and an excellent one.
What We're Tuning:
| Hyperparameter | Typical Range | Effect |
|---|---|---|
| Regularization λ | 10⁻⁶ to 10⁶ | Bias-variance tradeoff |
| RBF length scale ℓ | 0.01 to 100 × data spread | Smoothness |
| Polynomial degree p | 1 to 10 | Nonlinearity complexity |
| Signal variance σ² | 0.1 to 10 × target variance | Amplitude scale |
Tuning Methods:
The Marginal Likelihood Approach:
In the Gaussian Process view, we can optimize the marginal likelihood: $$\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{1}{2}\mathbf{y}^\top (\mathbf{K}\theta + \sigma^2 \mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}\theta + \sigma^2 \mathbf{I}| - \frac{n}{2}\log 2\pi$$
This balances:
Advantages:
Disadvantages:
Practical Recommendation:
Use marginal likelihood for finding a good region, then validate on held-out data to confirm generalization.
How do you choose a kernel for a new problem? Here's a practical decision framework based on problem characteristics.
Decision Tree:
Is the relationship likely linear?
Is smoothness expected?
Is there periodicity?
Multiple scales?
Variable feature importance?
Problem-Specific Recommendations:
| Problem Type | Characteristics | Recommended Kernel |
|---|---|---|
| Spatial (geostatistics) | Smooth, isotropic | Matérn-3/2 or Matérn-5/2 |
| Time series | Trend + seasonality | Linear + Periodic + RBF |
| Image features | High-d, sparse patterns | RBF or Chi-squared |
| Text/NLP | Sparse, high-d | Linear or Polynomial |
| Molecular properties | Structured + smooth | Graph kernels or RBF |
| Financial data | Noise + trends | Polynomial + RBF |
| Sensor data | Known physics | Domain-specific or Matérn |
| General regression | Unknown structure | RBF (default) |
Diagnostic Checks:
After fitting, examine predictions to validate kernel choice:
Residual plot: Are residuals random? Structure suggests missing components.
Smoothness: Are predictions over-smoothed (missed sharp features) or under-smoothed (wiggly interpolation)?
Extrapolation behavior: How do predictions behave outside training range?
Uncertainty calibration: If using GP interpretation, are 95% intervals well-calibrated?
Cross-validation stability: Same kernel should work across different data splits.
Using RBF blindly: It's a good default, but not always optimal. Consider smoothness requirements.
Ignoring scale: Always standardize inputs or use ARD. Raw features on different scales break isotropic kernels.
Too many hyperparameters: Complex kernel combinations can overfit. Start simple.
Ignoring domain knowledge: If you know the physics (e.g., periodic, monotonic), encode it in the kernel.
Not validating on held-out data: Marginal likelihood can overfit; always check generalization.
Here's a systematic workflow for selecting and tuning kernels in practice.
Step 1: Data Preprocessing
Step 2: Baseline Models
Step 3: Kernel Design Based on Domain
Step 4: Hyperparameter Tuning
Step 5: Model Validation
Step 6: Iterate
Typical Timeline:
| Stage | Time Investment |
|---|---|
| Preprocessing | 30% |
| Baseline experiments | 20% |
| Kernel design | 10% |
| Hyperparameter tuning | 30% |
| Validation & iteration | 10% |
Start simple, add complexity only when needed. A well-tuned RBF kernel beats a poorly-tuned complex kernel combination. Invest in hyperparameter tuning before kernel complexity.
This page concludes our deep dive into Kernel Ridge Regression. We've covered kernel selection—the art and science of choosing the right kernel for your problem. Let's consolidate the key insights:
Module Recap: Kernel Ridge Regression
Across 5 pages, we've developed complete mastery of kernel ridge regression:
You now have both the theoretical foundation and practical skills to apply kernel ridge regression effectively.
Congratulations! You have completed Module 2: Kernel Ridge Regression. You now understand the dual formulation, kernel matrix structure, prediction mechanics, computational tradeoffs, and kernel selection strategies. This foundation prepares you for Gaussian Processes (Module 3) and computational approximations (Module 5).