Loading content...
In the previous page, we derived the dual formulation of kernel ridge regression and saw that the entire algorithm reduces to operations involving a single matrix: the kernel matrix $\mathbf{K}$. This $n \times n$ matrix, also known as the Gram matrix, is the central computational object in kernel methods.
Understanding the kernel matrix deeply—its structure, properties, and behavior—is essential for effective use of kernel ridge regression. The kernel matrix encodes all the information about pairwise similarities between training points, and its properties directly determine:
By the end of this page, you will: • Understand the mathematical definition and construction of the kernel matrix • Master the crucial property of positive semi-definiteness • Explore the eigenstructure of K and its interpretation • Analyze how regularization modifies the spectrum • Develop intuition for how different kernels produce different matrix structures • Recognize numerical challenges and their solutions
Given $n$ training points $\mathbf{x}_1, \ldots, \mathbf{x}_n$ and a kernel function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$, the kernel matrix (or Gram matrix) $\mathbf{K} \in \mathbb{R}^{n \times n}$ is defined by:
$$K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$$
Equivalently, if $\phi: \mathcal{X} \to \mathcal{H}$ is the feature map corresponding to the kernel (so that $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$), then:
$$\mathbf{K} = \mathbf{\Phi} \mathbf{\Phi}^\top$$
where $\mathbf{\Phi}$ is the $n \times D$ matrix with $\phi(\mathbf{x}_i)^\top$ as the $i$-th row.
The kernel matrix requires $\binom{n}{2} + n = \frac{n(n+1)}{2}$ kernel evaluations (exploiting symmetry $K_{ij} = K_{ji}$). This is $O(n^2)$ evaluations, each costing $O(d)$ for typical kernels on d-dimensional inputs, giving total construction cost $O(n^2 d)$.
Example: Constructing K for Common Kernels
Let's see how the kernel matrix looks for different kernel functions. Consider 4 training points in 2D: $\mathbf{x}_1 = (0, 0)$, $\mathbf{x}_2 = (1, 0)$, $\mathbf{x}_3 = (0, 1)$, $\mathbf{x}_4 = (1, 1)$.
Linear kernel $k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$:
$$\mathbf{K}_{\text{linear}} = \begin{pmatrix} 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 1 \ 0 & 0 & 1 & 1 \ 0 & 1 & 1 & 2 \end{pmatrix}$$
Polynomial kernel $k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + 1)^2$:
$$\mathbf{K}_{\text{poly}} = \begin{pmatrix} 1 & 1 & 1 & 1 \ 1 & 4 & 1 & 4 \ 1 & 1 & 4 & 4 \ 1 & 4 & 4 & 9 \end{pmatrix}$$
RBF kernel $k(\mathbf{x}, \mathbf{x}') = \exp(-|\mathbf{x} - \mathbf{x}'|^2 / (2\sigma^2))$ with $\sigma = 1$:
$$\mathbf{K}_{\text{RBF}} = \begin{pmatrix} 1 & 0.61 & 0.61 & 0.37 \ 0.61 & 1 & 0.37 & 0.61 \ 0.61 & 0.37 & 1 & 0.61 \ 0.37 & 0.61 & 0.61 & 1 \end{pmatrix}$$
(values rounded)
Each entry $K_{ij}$ measures similarity between points $\mathbf{x}_i$ and $\mathbf{x}j$ according to the kernel. The diagonal $K{ii} = k(\mathbf{x}_i, \mathbf{x}_i)$ gives each point's 'self-similarity.' For normalized kernels, diagonals are 1. The off-diagonals reveal the similarity structure of your data—which points are 'close' in feature space.
The most important property of the kernel matrix is that it is positive semi-definite (PSD). This property is not incidental—it is intimately connected to the very definition of what makes a valid kernel.
Definition (Positive Semi-Definite Matrix):
A symmetric matrix $\mathbf{M} \in \mathbb{R}^{n \times n}$ is positive semi-definite if: $$\mathbf{v}^\top \mathbf{M} \mathbf{v} \geq 0 \quad \text{for all } \mathbf{v} \in \mathbb{R}^n$$
Theorem: For any valid kernel $k$ and any set of points $\mathbf{x}_1, \ldots, \mathbf{x}_n$, the kernel matrix $\mathbf{K}$ is positive semi-definite.
Proof:
For any $\mathbf{v} = (v_1, \ldots, v_n)^\top \in \mathbb{R}^n$:
$$\mathbf{v}^\top \mathbf{K} \mathbf{v} = \sum_{i=1}^n \sum_{j=1}^n v_i K_{ij} v_j = \sum_{i=1}^n \sum_{j=1}^n v_i v_j \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$$
$$= \left\langle \sum_{i=1}^n v_i \phi(\mathbf{x}i), \sum{j=1}^n v_j \phi(\mathbf{x}j) \right\rangle = \left| \sum{i=1}^n v_i \phi(\mathbf{x}_i) \right|^2 \geq 0$$
The quadratic form equals a squared norm, which is always non-negative. $\square$
Mercer's theorem tells us that a function $k$ is a valid kernel if and only if the kernel matrix $\mathbf{K}$ is positive semi-definite for every possible choice of points $\mathbf{x}_1, \ldots, \mathbf{x}_n$. This gives us an operational test: propose a similarity function, check that all resulting Gram matrices are PSD, and you have a valid kernel.
Why PSD Matters for Kernel Ridge Regression:
Guarantees convexity: The dual objective $\boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$ is a convex quadratic when $\mathbf{K}$ is PSD. This ensures a unique global minimum.
Enables regularized inversion: The matrix $(\mathbf{K} + \lambda \mathbf{I})$ is guaranteed to be positive definite (strictly) when $\lambda > 0$, because for any $\mathbf{v} eq \mathbf{0}$: $$\mathbf{v}^\top (\mathbf{K} + \lambda \mathbf{I}) \mathbf{v} = \underbrace{\mathbf{v}^\top \mathbf{K} \mathbf{v}}{\geq 0} + \lambda \underbrace{|\mathbf{v}|^2}{> 0} > 0$$
Ensures eigenvalues are non-negative: Spectral analysis of $\mathbf{K}$ (covered below) relies on all eigenvalues being $\geq 0$.
Connects to RKHS theory: PSD kernels correspond to inner products in reproducing kernel Hilbert spaces—the theoretical foundation for kernel methods.
Not all similarity measures are valid kernels! For example, the negative of squared Euclidean distance $k(\mathbf{x}, \mathbf{x}') = -|\mathbf{x} - \mathbf{x}'|^2$ is NOT a valid kernel (the resulting matrix is not PSD). Using invalid kernels in kernel methods leads to ill-posed optimization problems with potentially no solution or non-unique solutions.
The spectral decomposition of $\mathbf{K}$ reveals deep insights about the learning problem. Since $\mathbf{K}$ is symmetric and PSD, it has an eigendecomposition:
$$\mathbf{K} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^\top = \sum_{i=1}^n \lambda_i \mathbf{u}_i \mathbf{u}_i^\top$$
where:
Interpretation of Eigenvalues:
The eigenvalues of $\mathbf{K}$ measure the "variance" or "importance" of different directions in the data as measured by the kernel. Large eigenvalues correspond to directions along which the data varies significantly; small eigenvalues correspond to directions with little variation.
The eigenvectors of $\mathbf{K}$ are exactly the principal components found by Kernel PCA. The eigenvalue $\lambda_i$ represents the variance captured by the $i$-th principal component in feature space. This reveals that the kernel matrix contains the complete geometric information about the data in feature space.
Rank of the Kernel Matrix:
The rank of $\mathbf{K}$ is at most $\min(n, D)$ where $D = \dim(\mathcal{H})$. This has important implications:
Linear kernel: $\mathbf{K} = \mathbf{X}\mathbf{X}^\top$ has rank $\leq \min(n, d)$ where $d$ is input dimension. For $d < n$, $\mathbf{K}$ is singular.
Polynomial kernels: Have rank bounded by the dimension of the polynomial feature space.
RBF kernel: The corresponding feature space is infinite-dimensional, but practical matrices have "effective rank" determined by the bandwidth $\sigma$ and data spread.
Eigenvalue Decay:
Different kernels produce characteristic eigenvalue decay patterns:
| Kernel | Typical Eigenvalue Decay | Implication |
|---|---|---|
| Linear | Finite rank (≤ d) | Exactly d non-zero eigenvalues; fast linear solve |
| Polynomial (degree p) | Finite rank (≤ dim of poly space) | Moderate decay; well-conditioned |
| RBF (large σ) | Slow exponential decay | Many significant eigenvalues; smooth functions |
| RBF (small σ) | Fast exponential decay | Few significant eigenvalues; localized functions |
| Matérn (smooth) | Algebraic decay ~λᵢ ∝ i⁻²ᵛ | Controlled smoothness; good numerical properties |
The rate of eigenvalue decay determines the 'effective dimensionality' of your problem. Rapidly decaying eigenvalues mean the data effectively lives in a low-dimensional subspace of feature space. Slowly decaying eigenvalues indicate that many dimensions are important—the problem is genuinely high-dimensional even under the kernel embedding.
The kernel ridge regression solution involves inverting $(\mathbf{K} + \lambda \mathbf{I})$ rather than $\mathbf{K}$ directly. Let's analyze how regularization modifies the eigenstructure.
Eigenvalues of the Regularized Matrix:
If $\mathbf{K} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^\top$, then: $$\mathbf{K} + \lambda \mathbf{I} = \mathbf{U} (\mathbf{\Lambda} + \lambda \mathbf{I}) \mathbf{U}^\top$$
The eigenvalues become $\lambda_i + \lambda$ (the kernel eigenvalue plus the regularization parameter). The eigenvectors remain unchanged.
Inversion Analysis:
$$(\mathbf{K} + \lambda \mathbf{I})^{-1} = \mathbf{U} (\mathbf{\Lambda} + \lambda \mathbf{I})^{-1} \mathbf{U}^\top = \sum_{i=1}^n \frac{1}{\lambda_i + \lambda} \mathbf{u}_i \mathbf{u}_i^\top$$
The prediction for a new point can be written as:
$$f(\mathbf{x}*) = \sum{i=1}^n \frac{\lambda_i}{\lambda_i + \lambda} \cdot \mathbf{u}_i^\top \mathbf{y} \cdot \mathbf{u}i^\top \mathbf{k}*$$
The factor $\lambda_i / (\lambda_i + \lambda)$ acts as a filter: • When $\lambda_i >> \lambda$: factor ≈ 1 (eigendirection preserved) • When $\lambda_i << \lambda$: factor ≈ 0 (eigendirection suppressed) • When $\lambda_i ≈ \lambda$: factor ≈ 0.5 (transition region)
Interpretation: Regularization as Spectral Filtering
Regularization doesn't just "shrink" the solution—it performs spectral filtering. Directions in feature space with large eigenvalues (where data has high variance) are preserved, while directions with small eigenvalues (where data has low variance or where we're extrapolating) are suppressed.
This has profound implications:
Noise reduction: Noise often affects small-eigenvalue directions (where signal-to-noise ratio is low). Suppressing these directions reduces overfitting to noise.
Stability: Even if some $\lambda_i \approx 0$ (near-singularity), the regularized eigenvalue $\lambda_i + \lambda$ remains bounded away from zero.
Implicit dimensionality reduction: Large $\lambda$ effectively reduces the problem to the subspace spanned by eigenvectors with $\lambda_i >> \lambda$.
Smooth transition: Unlike hard thresholding, the filter $\lambda_i / (\lambda_i + \lambda)$ provides a smooth transition, avoiding discontinuities.
The effective degrees of freedom in kernel ridge regression is:
$$\text{df}(\lambda) = \text{trace}\left(\mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}\right) = \sum_{i=1}^n \frac{\lambda_i}{\lambda_i + \lambda}$$
This ranges from n (when λ → 0) to 0 (when λ → ∞), quantifying how much the model is constrained by regularization.
Different kernels produce kernel matrices with characteristic structures. Understanding these structures provides intuition for how each kernel behaves.
1. Linear Kernel: $k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$
$$\mathbf{K} = \mathbf{X}\mathbf{X}^\top$$
where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the data matrix. This is the standard sample covariance structure (up to centering). The matrix has rank at most $d$, so for high-dimensional data ($d \geq n$), it can be full rank, while for low-dimensional data ($d < n$), it is necessarily singular.
2. Polynomial Kernel: $k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^p$
The polynomial kernel adds higher-order interaction terms. The constant $c$ controls the balance between lower and higher-order features. When $c = 0$ (homogeneous polynomial), only degree-$p$ terms appear. When $c > 0$ (inhomogeneous), all degrees up to $p$ are included.
3. RBF (Gaussian) Kernel: $k(\mathbf{x}, \mathbf{x}') = \exp(-|\mathbf{x} - \mathbf{x}'|^2 / 2\sigma^2)$
The RBF kernel matrix has special structure:
• Diagonally dominant: $K_{ii} = 1$ for all $i$ (points are maximally similar to themselves) • Locality: $K_{ij}$ decays exponentially with distance—far points contribute little • Bandwidth sensitivity: Small σ → sparse-like matrix; large σ → dense matrix approaching constant • Always full rank: For distinct points, RBF kernel matrices are strictly positive definite (all eigenvalues > 0)
4. Laplacian Kernel: $k(\mathbf{x}, \mathbf{x}') = \exp(-|\mathbf{x} - \mathbf{x}'|_1 / \sigma)$
Similar to RBF but uses L1 distance. Produces less smooth functions; the corresponding RKHS contains less regular functions.
5. Custom/Composite Kernels:
Kernels can be combined:
These operations preserve positive semi-definiteness, allowing construction of sophisticated, domain-specific kernels.
| Property | Linear | Polynomial | RBF |
|---|---|---|---|
| Rank | min(n, d) | min(n, poly_dim) | n (typically) |
| Diagonal | ||xᵢ||² | (||xᵢ||² + c)ᵖ | 1 |
| Sparsity pattern | Dense | Dense | Bandwidth-dependent |
| Eigenvalue decay | Finite | Finite | Exponential |
| Condition number | Can be poor | Moderate | Bandwidth-dependent |
| Typical use | Linear models | Low-degree nonlinearity | General nonlinear |
Working with kernel matrices in practice requires attention to numerical issues. Poor conditioning can lead to inaccurate solutions or even complete failure.
Condition Number:
The condition number of $(\mathbf{K} + \lambda\mathbf{I})$ is: $$\kappa(\mathbf{K} + \lambda\mathbf{I}) = \frac{\lambda_1 + \lambda}{\lambda_n + \lambda}$$
where $\lambda_1$ and $\lambda_n$ are the largest and smallest eigenvalues of $\mathbf{K}$.
Practical Guidelines:
RBF kernels with very small σ produce kernel matrices with extremely fast eigenvalue decay. The matrix becomes numerically singular even when theoretically of full rank. To diagnose: if K is 'nearly identity' (off-diagonals ≈ 0), your σ is too small. If K is 'nearly constant' (all entries ≈ 1), your σ is too large relative to data spread.
Storage Considerations:
The kernel matrix requires $O(n^2)$ storage, which becomes prohibitive for large datasets:
| n | Storage (8-byte doubles) | Typical RAM |
|---|---|---|
| 1,000 | 8 MB | Trivial |
| 10,000 | 800 MB | Manageable |
| 100,000 | 80 GB | Challenging |
| 1,000,000 | 8 TB | Impractical |
For large-scale problems, approximation methods (covered in Module 5) become essential. The full kernel matrix simply cannot be stored or manipulated.
Let's connect the kernel matrix's mathematical properties to its role in learning.
Information Content:
The kernel matrix $\mathbf{K}$ captures all the information from the training data that is relevant for kernel ridge regression. Once $\mathbf{K}$ is computed, the original data $\mathbf{x}_i$ are not needed for solving the optimization—only for computing kernel evaluations at test time.
This has important implications:
Relationship to Training:
The dual coefficients can be written as:
$$\boldsymbol{\alpha}^* = \mathbf{U}(\mathbf{\Lambda} + \lambda\mathbf{I})^{-1}\mathbf{U}^\top \mathbf{y} = \sum_{i=1}^n \frac{\mathbf{u}_i^\top \mathbf{y}}{\lambda_i + \lambda} \mathbf{u}_i$$
Interpretation: • $\mathbf{u}_i^\top \mathbf{y}$: projection of targets onto $i$-th eigendirection (how much target varies along this direction) • $\frac{1}{\lambda_i + \lambda}$: inverse weighting (small eigenvalues → small coefficients) • Sum reconstructs coefficients by combining filtered projections
Leave-One-Out Prediction:
A remarkable property of kernel ridge regression is that leave-one-out cross-validation can be computed in closed form using only the kernel matrix:
$$\text{LOO error}_i = \frac{y_i - f(\mathbf{x}i)}{1 - H{ii}}$$
where $\mathbf{H} = \mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1}$ is the "hat matrix" (smoother matrix).
This allows efficient hyperparameter selection without refitting the model $n$ times—a direct consequence of the kernel matrix structure.
Kernel Alignment:
The "alignment" between the kernel matrix $\mathbf{K}$ and the target similarity matrix $\mathbf{y}\mathbf{y}^\top$ predicts learning performance:
$$\text{alignment}(\mathbf{K}, \mathbf{y}\mathbf{y}^\top) = \frac{\langle \mathbf{K}, \mathbf{y}\mathbf{y}^\top \rangle_F}{|\mathbf{K}|_F |\mathbf{y}\mathbf{y}^\top|_F}$$
Higher alignment means the kernel's similarity structure matches the target's structure, indicating the kernel is appropriate for the problem.
We've thoroughly examined the kernel matrix—the $n \times n$ object at the heart of kernel ridge regression. Let's consolidate our understanding:
What's Next:
With the kernel matrix understood, we turn to Prediction with Kernels. We'll examine:
You now have deep understanding of the kernel matrix—its construction, properties, eigenstructure, and numerical behavior. This foundation is essential for both theoretical understanding and practical implementation of kernel ridge regression.