Kernel Ridge Regression - Learning Module

Loading content...

0/278

Kernel Matrix

The Heart of Kernel Methods

In the previous page, we derived the dual formulation of kernel ridge regression and saw that the entire algorithm reduces to operations involving a single matrix: the kernel matrix $\mathbf{K}$. This $n \times n$ matrix, also known as the Gram matrix, is the central computational object in kernel methods.

Understanding the kernel matrix deeply—its structure, properties, and behavior—is essential for effective use of kernel ridge regression. The kernel matrix encodes all the information about pairwise similarities between training points, and its properties directly determine:

Whether the learning problem has a unique, well-defined solution
How numerically stable our computations will be
What kinds of functions we can learn
How effectively regularization controls complexity

What You Will Learn

By the end of this page, you will: • Understand the mathematical definition and construction of the kernel matrix • Master the crucial property of positive semi-definiteness • Explore the eigenstructure of K and its interpretation • Analyze how regularization modifies the spectrum • Develop intuition for how different kernels produce different matrix structures • Recognize numerical challenges and their solutions

Definition and Construction

Given $n$ training points $\mathbf{x}_1, \ldots, \mathbf{x}_n$ and a kernel function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$, the kernel matrix (or Gram matrix) $\mathbf{K} \in \mathbb{R}^{n \times n}$ is defined by:

$$K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$$

Equivalently, if $\phi: \mathcal{X} \to \mathcal{H}$ is the feature map corresponding to the kernel (so that $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$), then:

$$\mathbf{K} = \mathbf{\Phi} \mathbf{\Phi}^\top$$

where $\mathbf{\Phi}$ is the $n \times D$ matrix with $\phi(\mathbf{x}_i)^\top$ as the $i$-th row.

Computing the Kernel Matrix

The kernel matrix requires $\binom{n}{2} + n = \frac{n(n+1)}{2}$ kernel evaluations (exploiting symmetry $K_{ij} = K_{ji}$). This is $O(n^2)$ evaluations, each costing $O(d)$ for typical kernels on d-dimensional inputs, giving total construction cost $O(n^2 d)$.

Example: Constructing K for Common Kernels

Let's see how the kernel matrix looks for different kernel functions. Consider 4 training points in 2D: $\mathbf{x}_1 = (0, 0)$, $\mathbf{x}_2 = (1, 0)$, $\mathbf{x}_3 = (0, 1)$, $\mathbf{x}_4 = (1, 1)$.

Linear kernel $k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$:

$$\mathbf{K}_{\text{linear}} = \begin{pmatrix} 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 1 \ 0 & 0 & 1 & 1 \ 0 & 1 & 1 & 2 \end{pmatrix}$$

Polynomial kernel $k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + 1)^2$:

$$\mathbf{K}_{\text{poly}} = \begin{pmatrix} 1 & 1 & 1 & 1 \ 1 & 4 & 1 & 4 \ 1 & 1 & 4 & 4 \ 1 & 4 & 4 & 9 \end{pmatrix}$$

RBF kernel $k(\mathbf{x}, \mathbf{x}') = \exp(-|\mathbf{x} - \mathbf{x}'|^2 / (2\sigma^2))$ with $\sigma = 1$:

$$\mathbf{K}_{\text{RBF}} = \begin{pmatrix} 1 & 0.61 & 0.61 & 0.37 \ 0.61 & 1 & 0.37 & 0.61 \ 0.61 & 0.37 & 1 & 0.61 \ 0.37 & 0.61 & 0.61 & 1 \end{pmatrix}$$

(values rounded)

Interpreting the Kernel Matrix

Each entry $K_{ij}$ measures similarity between points $\mathbf{x}_i$ and $\mathbf{x}j$ according to the kernel. The diagonal $K{ii} = k(\mathbf{x}_i, \mathbf{x}_i)$ gives each point's 'self-similarity.' For normalized kernels, diagonals are 1. The off-diagonals reveal the similarity structure of your data—which points are 'close' in feature space.

Positive Semi-Definiteness

The most important property of the kernel matrix is that it is positive semi-definite (PSD). This property is not incidental—it is intimately connected to the very definition of what makes a valid kernel.

Definition (Positive Semi-Definite Matrix):

A symmetric matrix $\mathbf{M} \in \mathbb{R}^{n \times n}$ is positive semi-definite if: $$\mathbf{v}^\top \mathbf{M} \mathbf{v} \geq 0 \quad \text{for all } \mathbf{v} \in \mathbb{R}^n$$

Theorem: For any valid kernel $k$ and any set of points $\mathbf{x}_1, \ldots, \mathbf{x}_n$, the kernel matrix $\mathbf{K}$ is positive semi-definite.

Proof:

For any $\mathbf{v} = (v_1, \ldots, v_n)^\top \in \mathbb{R}^n$:

$$\mathbf{v}^\top \mathbf{K} \mathbf{v} = \sum_{i=1}^n \sum_{j=1}^n v_i K_{ij} v_j = \sum_{i=1}^n \sum_{j=1}^n v_i v_j \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$$

$$= \left\langle \sum_{i=1}^n v_i \phi(\mathbf{x}i), \sum{j=1}^n v_j \phi(\mathbf{x}j) \right\rangle = \left| \sum{i=1}^n v_i \phi(\mathbf{x}_i) \right|^2 \geq 0$$

The quadratic form equals a squared norm, which is always non-negative. $\square$

The PSD Condition Is Both Necessary and Sufficient

Mercer's theorem tells us that a function $k$ is a valid kernel if and only if the kernel matrix $\mathbf{K}$ is positive semi-definite for every possible choice of points $\mathbf{x}_1, \ldots, \mathbf{x}_n$. This gives us an operational test: propose a similarity function, check that all resulting Gram matrices are PSD, and you have a valid kernel.

Why PSD Matters for Kernel Ridge Regression:

Guarantees convexity: The dual objective $\boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$ is a convex quadratic when $\mathbf{K}$ is PSD. This ensures a unique global minimum.
Enables regularized inversion: The matrix $(\mathbf{K} + \lambda \mathbf{I})$ is guaranteed to be positive definite (strictly) when $\lambda > 0$, because for any $\mathbf{v} eq \mathbf{0}$: $$\mathbf{v}^\top (\mathbf{K} + \lambda \mathbf{I}) \mathbf{v} = \underbrace{\mathbf{v}^\top \mathbf{K} \mathbf{v}}{\geq 0} + \lambda \underbrace{|\mathbf{v}|^2}{> 0} > 0$$
Ensures eigenvalues are non-negative: Spectral analysis of $\mathbf{K}$ (covered below) relies on all eigenvalues being $\geq 0$.
Connects to RKHS theory: PSD kernels correspond to inner products in reproducing kernel Hilbert spaces—the theoretical foundation for kernel methods.

When PSD Fails

Not all similarity measures are valid kernels! For example, the negative of squared Euclidean distance $k(\mathbf{x}, \mathbf{x}') = -|\mathbf{x} - \mathbf{x}'|^2$ is NOT a valid kernel (the resulting matrix is not PSD). Using invalid kernels in kernel methods leads to ill-posed optimization problems with potentially no solution or non-unique solutions.

Eigenstructure of the Kernel Matrix

The spectral decomposition of $\mathbf{K}$ reveals deep insights about the learning problem. Since $\mathbf{K}$ is symmetric and PSD, it has an eigendecomposition:

$$\mathbf{K} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^\top = \sum_{i=1}^n \lambda_i \mathbf{u}_i \mathbf{u}_i^\top$$

where:

$\mathbf{U} = [\mathbf{u}_1 | \cdots | \mathbf{u}_n]$ is an orthogonal matrix of eigenvectors
$\mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0$

Interpretation of Eigenvalues:

The eigenvalues of $\mathbf{K}$ measure the "variance" or "importance" of different directions in the data as measured by the kernel. Large eigenvalues correspond to directions along which the data varies significantly; small eigenvalues correspond to directions with little variation.

Connection to Kernel PCA

The eigenvectors of $\mathbf{K}$ are exactly the principal components found by Kernel PCA. The eigenvalue $\lambda_i$ represents the variance captured by the $i$-th principal component in feature space. This reveals that the kernel matrix contains the complete geometric information about the data in feature space.

Rank of the Kernel Matrix:

The rank of $\mathbf{K}$ is at most $\min(n, D)$ where $D = \dim(\mathcal{H})$. This has important implications:

Linear kernel: $\mathbf{K} = \mathbf{X}\mathbf{X}^\top$ has rank $\leq \min(n, d)$ where $d$ is input dimension. For $d < n$, $\mathbf{K}$ is singular.
Polynomial kernels: Have rank bounded by the dimension of the polynomial feature space.
RBF kernel: The corresponding feature space is infinite-dimensional, but practical matrices have "effective rank" determined by the bandwidth $\sigma$ and data spread.

Eigenvalue Decay:

Different kernels produce characteristic eigenvalue decay patterns:

Eigenvalue Decay by Kernel Type
Kernel	Typical Eigenvalue Decay	Implication
Linear	Finite rank (≤ d)	Exactly d non-zero eigenvalues; fast linear solve
Polynomial (degree p)	Finite rank (≤ dim of poly space)	Moderate decay; well-conditioned
RBF (large σ)	Slow exponential decay	Many significant eigenvalues; smooth functions
RBF (small σ)	Fast exponential decay	Few significant eigenvalues; localized functions
Matérn (smooth)	Algebraic decay ~λᵢ ∝ i⁻²ᵛ	Controlled smoothness; good numerical properties

Eigenvalue Decay and Effective Dimensionality

The rate of eigenvalue decay determines the 'effective dimensionality' of your problem. Rapidly decaying eigenvalues mean the data effectively lives in a low-dimensional subspace of feature space. Slowly decaying eigenvalues indicate that many dimensions are important—the problem is genuinely high-dimensional even under the kernel embedding.

Effect of Regularization on the Spectrum

The kernel ridge regression solution involves inverting $(\mathbf{K} + \lambda \mathbf{I})$ rather than $\mathbf{K}$ directly. Let's analyze how regularization modifies the eigenstructure.

Eigenvalues of the Regularized Matrix:

If $\mathbf{K} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^\top$, then: $$\mathbf{K} + \lambda \mathbf{I} = \mathbf{U} (\mathbf{\Lambda} + \lambda \mathbf{I}) \mathbf{U}^\top$$

The eigenvalues become $\lambda_i + \lambda$ (the kernel eigenvalue plus the regularization parameter). The eigenvectors remain unchanged.

Inversion Analysis:

$$(\mathbf{K} + \lambda \mathbf{I})^{-1} = \mathbf{U} (\mathbf{\Lambda} + \lambda \mathbf{I})^{-1} \mathbf{U}^\top = \sum_{i=1}^n \frac{1}{\lambda_i + \lambda} \mathbf{u}_i \mathbf{u}_i^\top$$

Regularization as Eigenvalue Filtering

The prediction for a new point can be written as:

$$f(\mathbf{x}*) = \sum{i=1}^n \frac{\lambda_i}{\lambda_i + \lambda} \cdot \mathbf{u}_i^\top \mathbf{y} \cdot \mathbf{u}i^\top \mathbf{k}*$$

The factor $\lambda_i / (\lambda_i + \lambda)$ acts as a filter: • When $\lambda_i >> \lambda$: factor ≈ 1 (eigendirection preserved) • When $\lambda_i << \lambda$: factor ≈ 0 (eigendirection suppressed) • When $\lambda_i ≈ \lambda$: factor ≈ 0.5 (transition region)

Interpretation: Regularization as Spectral Filtering

Regularization doesn't just "shrink" the solution—it performs spectral filtering. Directions in feature space with large eigenvalues (where data has high variance) are preserved, while directions with small eigenvalues (where data has low variance or where we're extrapolating) are suppressed.

This has profound implications:

Noise reduction: Noise often affects small-eigenvalue directions (where signal-to-noise ratio is low). Suppressing these directions reduces overfitting to noise.
Stability: Even if some $\lambda_i \approx 0$ (near-singularity), the regularized eigenvalue $\lambda_i + \lambda$ remains bounded away from zero.
Implicit dimensionality reduction: Large $\lambda$ effectively reduces the problem to the subspace spanned by eigenvectors with $\lambda_i >> \lambda$.
Smooth transition: Unlike hard thresholding, the filter $\lambda_i / (\lambda_i + \lambda)$ provides a smooth transition, avoiding discontinuities.

Small λ (Under-regularized)

•Eigenvalue ratios ≈ 1 for most directions
•Small eigenvalue directions preserved
•Sensitive to noise in low-variance directions
•Potential numerical instability
•Risk of overfitting

Large λ (Over-regularized)

•Eigenvalue ratios ≈ λᵢ/λ << 1
•Most directions heavily suppressed
•Only largest eigenvalue directions survive
•Very stable but may underfit
•Solution approaches zero

The Degrees of Freedom Interpretation

The effective degrees of freedom in kernel ridge regression is:

$$\text{df}(\lambda) = \text{trace}\left(\mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}\right) = \sum_{i=1}^n \frac{\lambda_i}{\lambda_i + \lambda}$$

This ranges from n (when λ → 0) to 0 (when λ → ∞), quantifying how much the model is constrained by regularization.

Kernel Matrix Structure for Common Kernels

Different kernels produce kernel matrices with characteristic structures. Understanding these structures provides intuition for how each kernel behaves.

1. Linear Kernel: $k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$

$$\mathbf{K} = \mathbf{X}\mathbf{X}^\top$$

where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the data matrix. This is the standard sample covariance structure (up to centering). The matrix has rank at most $d$, so for high-dimensional data ($d \geq n$), it can be full rank, while for low-dimensional data ($d < n$), it is necessarily singular.

2. Polynomial Kernel: $k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^p$

The polynomial kernel adds higher-order interaction terms. The constant $c$ controls the balance between lower and higher-order features. When $c = 0$ (homogeneous polynomial), only degree-$p$ terms appear. When $c > 0$ (inhomogeneous), all degrees up to $p$ are included.

3. RBF (Gaussian) Kernel: $k(\mathbf{x}, \mathbf{x}') = \exp(-|\mathbf{x} - \mathbf{x}'|^2 / 2\sigma^2)$

RBF Kernel Matrix Properties

The RBF kernel matrix has special structure:

• Diagonally dominant: $K_{ii} = 1$ for all $i$ (points are maximally similar to themselves) • Locality: $K_{ij}$ decays exponentially with distance—far points contribute little • Bandwidth sensitivity: Small σ → sparse-like matrix; large σ → dense matrix approaching constant • Always full rank: For distinct points, RBF kernel matrices are strictly positive definite (all eigenvalues > 0)

4. Laplacian Kernel: $k(\mathbf{x}, \mathbf{x}') = \exp(-|\mathbf{x} - \mathbf{x}'|_1 / \sigma)$

Similar to RBF but uses L1 distance. Produces less smooth functions; the corresponding RKHS contains less regular functions.

5. Custom/Composite Kernels:

Kernels can be combined:

Sum: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$ → $\mathbf{K} = \mathbf{K}_1 + \mathbf{K}_2$
Product: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$ → $\mathbf{K} = \mathbf{K}_1 \odot \mathbf{K}_2$ (elementwise product)

These operations preserve positive semi-definiteness, allowing construction of sophisticated, domain-specific kernels.

Kernel Matrix Properties Comparison
Property	Linear	Polynomial	RBF
Rank	min(n, d)	min(n, poly_dim)	n (typically)
Diagonal	\|\|xᵢ\|\|²	(\|\|xᵢ\|\|² + c)ᵖ	1
Sparsity pattern	Dense	Dense	Bandwidth-dependent
Eigenvalue decay	Finite	Finite	Exponential
Condition number	Can be poor	Moderate	Bandwidth-dependent
Typical use	Linear models	Low-degree nonlinearity	General nonlinear

Numerical Considerations

Working with kernel matrices in practice requires attention to numerical issues. Poor conditioning can lead to inaccurate solutions or even complete failure.

Condition Number:

The condition number of $(\mathbf{K} + \lambda\mathbf{I})$ is: $$\kappa(\mathbf{K} + \lambda\mathbf{I}) = \frac{\lambda_1 + \lambda}{\lambda_n + \lambda}$$

where $\lambda_1$ and $\lambda_n$ are the largest and smallest eigenvalues of $\mathbf{K}$.

As $\lambda \to 0$: condition number approaches $\kappa(\mathbf{K}) = \lambda_1 / \lambda_n$, which can be very large (or infinite if $\lambda_n = 0$).
As $\lambda \to \infty$: condition number approaches 1 (perfectly conditioned).

Practical Guidelines:

Numerical Stability Best Practices

•Always regularize: Never set λ = 0 in practice. Even λ = 10⁻⁸ provides stability for well-conditioned matrices.
•Scale inputs: Standardize features to similar scales before computing kernels. Rescaling by standard deviation often helps.
•Use Cholesky decomposition: For solving (K + λI)α = y, compute the Cholesky factor L where K + λI = LLᵀ, then solve Lz = y and Lᵀα = z. This is numerically superior to explicit inversion.
•Check for numerical issues: Compute the smallest eigenvalue; if λₙ + λ < 10⁻¹⁴ · (λ₁ + λ), expect numerical problems.
•Use pivoted decompositions: When K is near-singular, pivoted Cholesky or QR provides better stability.
•Consider iterative methods: For large n, conjugate gradient methods avoid forming K⁻¹ explicitly.

The RBF Bandwidth Trap

RBF kernels with very small σ produce kernel matrices with extremely fast eigenvalue decay. The matrix becomes numerically singular even when theoretically of full rank. To diagnose: if K is 'nearly identity' (off-diagonals ≈ 0), your σ is too small. If K is 'nearly constant' (all entries ≈ 1), your σ is too large relative to data spread.

Storage Considerations:

The kernel matrix requires $O(n^2)$ storage, which becomes prohibitive for large datasets:

n	Storage (8-byte doubles)	Typical RAM
1,000	8 MB	Trivial
10,000	800 MB	Manageable
100,000	80 GB	Challenging
1,000,000	8 TB	Impractical

For large-scale problems, approximation methods (covered in Module 5) become essential. The full kernel matrix simply cannot be stored or manipulated.

The Kernel Matrix in Learning Context

Let's connect the kernel matrix's mathematical properties to its role in learning.

Information Content:

The kernel matrix $\mathbf{K}$ captures all the information from the training data that is relevant for kernel ridge regression. Once $\mathbf{K}$ is computed, the original data $\mathbf{x}_i$ are not needed for solving the optimization—only for computing kernel evaluations at test time.

This has important implications:

Privacy: We could release $\mathbf{K}$ without releasing raw data (though this provides only limited protection)
Preprocessing: Any preprocessing that preserves pairwise kernel values gives the same solution
Sufficient statistic: $\mathbf{K}$ is a sufficient statistic for the training procedure

Relationship to Training:

The Training Process Through Eigenstructure

The dual coefficients can be written as:

$$\boldsymbol{\alpha}^* = \mathbf{U}(\mathbf{\Lambda} + \lambda\mathbf{I})^{-1}\mathbf{U}^\top \mathbf{y} = \sum_{i=1}^n \frac{\mathbf{u}_i^\top \mathbf{y}}{\lambda_i + \lambda} \mathbf{u}_i$$

Interpretation: • $\mathbf{u}_i^\top \mathbf{y}$: projection of targets onto $i$-th eigendirection (how much target varies along this direction) • $\frac{1}{\lambda_i + \lambda}$: inverse weighting (small eigenvalues → small coefficients) • Sum reconstructs coefficients by combining filtered projections

Leave-One-Out Prediction:

A remarkable property of kernel ridge regression is that leave-one-out cross-validation can be computed in closed form using only the kernel matrix:

$$\text{LOO error}_i = \frac{y_i - f(\mathbf{x}i)}{1 - H{ii}}$$

where $\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$ is the "hat matrix" (smoother matrix).

This allows efficient hyperparameter selection without refitting the model $n$ times—a direct consequence of the kernel matrix structure.

Kernel Alignment:

The "alignment" between the kernel matrix $\mathbf{K}$ and the target similarity matrix $\mathbf{y}\mathbf{y}^\top$ predicts learning performance:

$$\text{alignment}(\mathbf{K}, \mathbf{y}\mathbf{y}^\top) = \frac{\langle \mathbf{K}, \mathbf{y}\mathbf{y}^\top \rangle_F}{|\mathbf{K}|_F |\mathbf{y}\mathbf{y}^\top|_F}$$

Higher alignment means the kernel's similarity structure matches the target's structure, indicating the kernel is appropriate for the problem.

Summary and What's Next

We've thoroughly examined the kernel matrix—the $n \times n$ object at the heart of kernel ridge regression. Let's consolidate our understanding:

Key Takeaways

•Definition: $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ captures pairwise similarities; $\mathbf{K} = \mathbf{\Phi}\mathbf{\Phi}^\top$ in terms of feature maps.
•Positive semi-definiteness is essential—it guarantees convexity, unique solutions, and connection to RKHS theory.
•Eigenstructure reveals the effective dimensionality of the problem and the directions of variation in feature space.
•Regularization acts as spectral filtering, suppressing small-eigenvalue directions and ensuring numerical stability.
•Different kernels produce matrices with characteristic structures that determine what functions can be learned.
•Numerical considerations are crucial—always regularize, use stable algorithms, and be aware of conditioning.

What's Next:

With the kernel matrix understood, we turn to Prediction with Kernels. We'll examine:

How to evaluate predictions at new points using the dual coefficients
The geometric interpretation of kernel predictions
Confidence and uncertainty estimation
The relationship between prediction and smoothing
Efficient prediction for multiple test points

Page Complete

You now have deep understanding of the kernel matrix—its construction, properties, eigenstructure, and numerical behavior. This foundation is essential for both theoretical understanding and practical implementation of kernel ridge regression.