Machine LearningKernel Ridge Regression

Kernel Ridge Regression

LevelAdvanced

Duration90 mins

TopicKernel Ridge Regression

1 / 5

Dual Formulation

The Gateway to Kernel Methods

In the previous module, we explored the kernel trick—the remarkable mathematical insight that allows us to compute inner products in high-dimensional (possibly infinite-dimensional) feature spaces without explicitly constructing those spaces. We saw that this trick hinges on expressing algorithms solely in terms of inner products between data points.

But here's the critical question: How do we reformulate ridge regression so that it depends only on inner products?

This is where the dual formulation becomes essential. The primal form of ridge regression expresses the solution in terms of the weight vector $\mathbf{w}$, which lives in feature space. The dual form expresses the same solution in terms of coefficients that weight the training examples themselves. This shift in perspective—from feature-centric to example-centric—is what unlocks the full power of kernel methods.

What You Will Learn

By the end of this page, you will: • Derive the dual formulation of ridge regression from first principles • Understand the Representer Theorem and its profound implications • See exactly how the kernel trick naturally emerges from the dual • Grasp the geometric intuition behind the primal-dual relationship • Recognize when and why working in the dual is computationally advantageous

The Primal Problem Revisited

Let's begin by establishing our notation precisely and recalling the primal formulation of ridge regression.

Setup:

Training data: ${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$
Feature transformation: $\phi: \mathbb{R}^d \to \mathcal{H}$ mapping inputs to a (potentially high-dimensional) feature space $\mathcal{H}$
Design matrix in feature space: $\mathbf{\Phi} \in \mathbb{R}^{n \times D}$ where row $i$ is $\phi(\mathbf{x}_i)^\top$ and $D = \dim(\mathcal{H})$

The primal ridge regression objective seeks weights $\mathbf{w} \in \mathcal{H}$ that minimize:

Primal Objective Function

$$\mathcal{L}_{\text{primal}}(\mathbf{w}) = |\mathbf{\Phi}\mathbf{w} - \mathbf{y}|^2 + \lambda |\mathbf{w}|^2$$

where $\lambda > 0$ is the regularization parameter, $\mathbf{y} = (y_1, \ldots, y_n)^\top$ is the target vector, and the first term measures training error while the second penalizes model complexity.

Setting the gradient to zero, we derived the primal solution:

$$\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I}_D)^{-1} \mathbf{\Phi}^\top \mathbf{y}$$

This solution has several important characteristics:

Dimensionality dependence: The matrix $\mathbf{\Phi}^\top \mathbf{\Phi}$ is $D \times D$, so inversion costs $O(D^3)$
Explicit feature requirement: We need to compute $\phi(\mathbf{x}_i)$ for each training point
Feature space storage: The weight vector $\mathbf{w}$ has $D$ components

When $D$ is large (or infinite, as with RBF kernels), this primal form becomes computationally intractable or mathematically undefined. We need a different approach.

The Computational Challenge

Consider a polynomial kernel of degree 5 on 1000-dimensional input. The explicit feature space has $\binom{1000+5}{5} \approx 8.4 \times 10^{12}$ dimensions. No computer can store or invert an $8.4$ trillion × $8.4$ trillion matrix. Yet kernel ridge regression handles this effortlessly—the dual formulation is why.

The Representer Theorem

Before deriving the dual, we establish a foundational result that explains why the dual form is even possible. This result, known as the Representer Theorem, is one of the most important theorems in kernel methods and regularization theory.

Intuition: When we solve a regularized learning problem, we're balancing fit to the data against model complexity. The theorem tells us that no matter how rich our feature space is, the optimal solution always lies in a finite-dimensional subspace spanned by the training examples themselves.

The Representer Theorem (Informal)

For a broad class of regularized learning problems (including ridge regression), the optimal solution $\mathbf{w}^*$ can be written as a linear combination of the feature-mapped training points:

$$\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i) = \mathbf{\Phi}^\top \boldsymbol{\alpha}$$

for some coefficients $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n$.

Formal Statement:

Consider the optimization problem: $$\min_{\mathbf{w} \in \mathcal{H}} \left[ L(\mathbf{w}^\top \phi(\mathbf{x}_1), \ldots, \mathbf{w}^\top \phi(\mathbf{x}_n), y_1, \ldots, y_n) + \Omega(|\mathbf{w}|) \right]$$

where:

$L$ is any loss function depending on predictions and targets
$\Omega: \mathbb{R}^+ \to \mathbb{R}$ is a strictly increasing regularization function

Then there exist $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$ such that $\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i)$.

Proof Sketch:

Decompose any $\mathbf{w}$ as: $$\mathbf{w} = \underbrace{\sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}i)}{\mathbf{w}\parallel \in \text{span}{\phi(\mathbf{x}i)}} + \underbrace{\mathbf{w}\perp}{\perp \text{ to span}}$$

Critically, predictions depend only on $\mathbf{w}_\parallel$: $$\mathbf{w}^\top \phi(\mathbf{x}j) = \mathbf{w}\parallel^\top \phi(\mathbf{x}j) + \mathbf{w}\perp^\top \phi(\mathbf{x}j) = \mathbf{w}\parallel^\top \phi(\mathbf{x}_j)$$

since $\mathbf{w}_\perp \perp \phi(\mathbf{x}_j)$.

But $|\mathbf{w}|^2 = |\mathbf{w}\parallel|^2 + |\mathbf{w}\perp|^2 \geq |\mathbf{w}_\parallel|^2$.

Thus, $\mathbf{w}\perp$ adds to the regularization penalty without affecting predictions. Since $\Omega$ is strictly increasing, the optimal solution must have $\mathbf{w}\perp = \mathbf{0}$. $\square$

Why This Matters

The Representer Theorem is profound: even if your feature space has billions (or infinitely many) dimensions, the optimal weights live in an n-dimensional subspace determined entirely by your n training examples. This is why we can switch from optimizing over w ∈ ℝ^D to optimizing over α ∈ ℝ^n—a dramatic reduction when n << D.

Deriving the Dual Formulation

Now we perform the key derivation: substituting the representer form $\mathbf{w} = \mathbf{\Phi}^\top \boldsymbol{\alpha}$ into the primal objective.

Step 1: Substitute into the prediction term

The predicted values on training data become: $$\mathbf{\Phi} \mathbf{w} = \mathbf{\Phi} \mathbf{\Phi}^\top \boldsymbol{\alpha} = \mathbf{K} \boldsymbol{\alpha}$$

where $\mathbf{K} = \mathbf{\Phi} \mathbf{\Phi}^\top$ is the kernel matrix (Gram matrix) with entries: $$K_{ij} = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j) = k(\mathbf{x}_i, \mathbf{x}_j)$$

Step 2: Substitute into the regularization term

$$|\mathbf{w}|^2 = \mathbf{w}^\top \mathbf{w} = (\mathbf{\Phi}^\top \boldsymbol{\alpha})^\top (\mathbf{\Phi}^\top \boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{\Phi} \mathbf{\Phi}^\top \boldsymbol{\alpha} = \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$

Step 3: Write the dual objective

Dual Objective Function

$$\mathcal{L}_{\text{dual}}(\boldsymbol{\alpha}) = |\mathbf{K}\boldsymbol{\alpha} - \mathbf{y}|^2 + \lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$

Expanding: $$= \boldsymbol{\alpha}^\top \mathbf{K}^2 \boldsymbol{\alpha} - 2\mathbf{y}^\top \mathbf{K} \boldsymbol{\alpha} + \mathbf{y}^\top \mathbf{y} + \lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$

$$= \boldsymbol{\alpha}^\top (\mathbf{K}^2 + \lambda \mathbf{K}) \boldsymbol{\alpha} - 2\mathbf{y}^\top \mathbf{K} \boldsymbol{\alpha} + \text{const}$$

Step 4: Solve by setting gradient to zero

$$\nabla_\alpha \mathcal{L}_{\text{dual}} = 2(\mathbf{K}^2 + \lambda \mathbf{K})\boldsymbol{\alpha} - 2\mathbf{K}\mathbf{y} = \mathbf{0}$$

Since $\mathbf{K}$ is positive semi-definite, for $\lambda > 0$, we have $(\mathbf{K} + \lambda \mathbf{I})$ invertible. Thus: $$\mathbf{K}(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{K}\mathbf{y}$$

If $\mathbf{K}$ has full rank (or we can work in its column space), we get: $$(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{y}$$

The Dual Solution

$$\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I}_n)^{-1} \mathbf{y}$$

This is the closed-form solution for kernel ridge regression in dual form. Notice: • The matrix is n × n, not D × D • It depends only on kernel evaluations k(xᵢ, xⱼ) • No explicit features φ(x) appear anywhere

The key insight: We've completely eliminated the feature mapping $\phi$ from both the optimization and the solution. All that matters is the kernel matrix $\mathbf{K}$, which can be computed directly using the kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$.

This is the kernel trick in action for ridge regression.

Equivalence of Primal and Dual Solutions

A natural question arises: Are the primal and dual solutions truly equivalent? They're derived differently—do they produce the same predictions?

The answer is a resounding yes. Let's prove this rigorously.

From Dual to Primal:

Given $\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$, the corresponding weight vector is: $$\mathbf{w}^* = \mathbf{\Phi}^\top \boldsymbol{\alpha}^* = \mathbf{\Phi}^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$

From Primal to Dual:

The primal solution is: $$\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I}_D)^{-1} \mathbf{\Phi}^\top \mathbf{y}$$

Claim: These expressions yield identical predictions.

The Woodbury Identity

For matrices A, B, C of compatible dimensions with A and C invertible:

$$(\mathbf{A} + \mathbf{B}\mathbf{C}^{-1}\mathbf{B}^\top)^{-1}\mathbf{B}\mathbf{C}^{-1} = \mathbf{A}^{-1}\mathbf{B}(\mathbf{C} + \mathbf{B}^\top\mathbf{A}^{-1}\mathbf{B})^{-1}$$

This identity (a consequence of the matrix inversion lemma) lets us convert between primal and dual forms.

Proof of Equivalence:

We need to show that for any test point $\mathbf{x}*$: $$\phi(\mathbf{x})^\top \mathbf{w}^{\text{primal}} = \phi(\mathbf{x})^\top \mathbf{w}^_{\text{dual}}$$

Using the Woodbury identity with $\mathbf{A} = \lambda \mathbf{I}_D$ and $\mathbf{B} = \mathbf{\Phi}^\top$, $\mathbf{C} = \mathbf{I}_n$:

$$(\lambda \mathbf{I} + \mathbf{\Phi}^\top \mathbf{\Phi})^{-1} \mathbf{\Phi}^\top = \frac{1}{\lambda}\mathbf{\Phi}^\top (\mathbf{I} + \frac{1}{\lambda}\mathbf{\Phi}\mathbf{\Phi}^\top)^{-1}$$ $$= \mathbf{\Phi}^\top (\lambda \mathbf{I} + \mathbf{\Phi}\mathbf{\Phi}^\top)^{-1} = \mathbf{\Phi}^\top (\lambda \mathbf{I} + \mathbf{K})^{-1}$$

Therefore: $$\mathbf{w}^_{\text{primal}} = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^\top \mathbf{y} = \mathbf{\Phi}^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y} = \mathbf{w}^_{\text{dual}}$$

The solutions are identical. $\square$

Numerical Perspective

While mathematically equivalent, the primal and dual may differ numerically due to finite precision arithmetic. The dual is numerically preferable when n << D (fewer computations, smaller matrices), while the primal is preferable when D << n. The crossover point depends on implementation details and problem conditioning.

Geometric Interpretation

The duality between primal and dual formulations has a beautiful geometric interpretation that deepens our understanding of what's happening.

The Primal View:

In the primal, we're finding weights $\mathbf{w} \in \mathcal{H}$ that define a linear function $f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})$. The space of all such functions is the full feature space $\mathcal{H}$, which may have dimension $D \gg n$ or even be infinite-dimensional.

The regularization $|\mathbf{w}|^2$ penalizes directions in $\mathcal{H}$ that are "expensive" according to the norm.

The Dual View:

In the dual, we're expressing the same function as: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}i, \mathbf{x}) = \sum{i=1}^n \alpha_i \phi(\mathbf{x}_i)^\top \phi(\mathbf{x})$$

This represents $f$ as a weighted combination of "basis functions" centered at each training point. The coefficient $\alpha_i$ determines how much training example $i$ contributes to the prediction.

Primal Perspective

•Optimization space: D-dimensional feature space
•Variables: Weight vector w ∈ ℝ^D
•Computation: O(D³) matrix inversion
•Interpretation: Hyperplane in feature space
•Best when: D << n (low-dimensional features)

Dual Perspective

•Optimization space: n-dimensional coefficient space
•Variables: Dual coefficients α ∈ ℝ^n
•Computation: O(n³) matrix inversion
•Interpretation: Linear combo of training kernels
•Best when: n << D (high-dimensional features)

The Subspace Relationship:

The Representer Theorem tells us that $\mathbf{w}^* \in \text{span}{\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)}$. This subspace has dimension at most $n$, regardless of $D$.

Geometrically, imagine the feature space $\mathcal{H}$ as a vast arena. The training examples $\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)$ define a small "tent" within this arena—an $n$-dimensional subspace. The Representer Theorem guarantees that the optimal $\mathbf{w}^*$ lies within this tent.

The dual optimization directly parameterizes vectors in this tent via the coordinates $\boldsymbol{\alpha}$. This is why it's computationally efficient: we're searching an $n$-dimensional space instead of a $D$-dimensional one.

The Key Geometric Insight

Primal and dual are just different coordinate systems for the same solution. The primal uses the standard basis of the feature space; the dual uses the training examples as basis vectors. When the training examples span a much smaller space than the full feature space, the dual coordinates are more efficient.

Why the Kernel Trick Works

We're now in a position to see precisely why the kernel trick is so powerful for ridge regression.

Observation 1: The dual solution depends only on K

$$\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$

To compute $\boldsymbol{\alpha}^*$, we never need explicit features $\phi(\mathbf{x}_i)$. We only need:

The kernel matrix $\mathbf{K}$ where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
The target values $\mathbf{y}$
The regularization parameter $\lambda$

Observation 2: Predictions depend only on kernel evaluations

For a test point $\mathbf{x}*$: $$f(\mathbf{x}) = \mathbf{w}^{\top} \phi(\mathbf{x}*) = \sum{i=1}^n \alpha_i^* \phi(\mathbf{x}i)^\top \phi(\mathbf{x}) = \sum_{i=1}^n \alpha_i^ k(\mathbf{x}i, \mathbf{x}*)$$

The Prediction Formula

$$f(\mathbf{x}*) = \mathbf{k}^\top \boldsymbol{\alpha}^ = \mathbf{k}_*^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$

where $\mathbf{k}_* = (k(\mathbf{x}1, \mathbf{x}), \ldots, k(\mathbf{x}n, \mathbf{x}))^\top$ is the vector of kernel evaluations between the test point and all training points.

Observation 3: The entire algorithm is "kernelized"

Every step of kernel ridge regression can be performed with only kernel evaluations:

Training: Compute $\mathbf{K}$ via $n^2$ kernel evaluations, then solve $(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{y}$
Prediction: Compute $\mathbf{k}*$ via $n$ kernel evaluations, then return $\mathbf{k}*^\top \boldsymbol{\alpha}$

Observation 4: This works for infinite-dimensional feature spaces

For the RBF kernel: $$k(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2}\right)$$

The corresponding feature space $\mathcal{H}$ is infinite-dimensional. Yet we can:

Compute the kernel matrix $\mathbf{K}$ (just plug into the formula)
Solve the dual (just linear algebra in $\mathbb{R}^n$)
Make predictions (just dot products)

We're effectively working in infinite dimensions without ever constructing an infinite-dimensional vector!

The Miracle of Kernels

The dual formulation transforms a potentially impossible optimization in D-dimensional feature space into a tractable n × n linear system. When D = ∞ (as with RBF kernels), this isn't just convenient—it's the only way to proceed. The kernel trick makes the impossible possible.

Alternative Derivation: Lagrangian Methods

For completeness, let's derive the dual using Lagrangian duality—the classical optimization approach. This provides additional insight and connects to the broader theory of convex optimization.

Reformulate as Constrained Optimization:

Rewrite ridge regression with an auxiliary variable: $$\min_{\mathbf{w}, \boldsymbol{\xi}} |\boldsymbol{\xi}|^2 + \lambda |\mathbf{w}|^2 \quad \text{s.t.} \quad \boldsymbol{\xi} = \mathbf{\Phi}\mathbf{w} - \mathbf{y}$$

Form the Lagrangian: $$\mathcal{L}(\mathbf{w}, \boldsymbol{\xi}, \boldsymbol{\beta}) = |\boldsymbol{\xi}|^2 + \lambda |\mathbf{w}|^2 + \boldsymbol{\beta}^\top(\mathbf{\Phi}\mathbf{w} - \mathbf{y} - \boldsymbol{\xi})$$

where $\boldsymbol{\beta} \in \mathbb{R}^n$ are Lagrange multipliers.

Minimize over primal variables:

$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\xi}} = 2\boldsymbol{\xi} - \boldsymbol{\beta} = \mathbf{0} \implies \boldsymbol{\xi}^* = \frac{1}{2}\boldsymbol{\beta}$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\lambda \mathbf{w} + \mathbf{\Phi}^\top \boldsymbol{\beta} = \mathbf{0} \implies \mathbf{w}^* = -\frac{1}{2\lambda}\mathbf{\Phi}^\top \boldsymbol{\beta}$$

Substitute back to get dual:

$$g(\boldsymbol{\beta}) = \frac{1}{4}|\boldsymbol{\beta}|^2 + \frac{1}{4\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} - \frac{1}{2\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} + \boldsymbol{\beta}^\top\mathbf{y} - \frac{1}{2}|\boldsymbol{\beta}|^2$$

Simplifying: $$g(\boldsymbol{\beta}) = -\frac{1}{4}|\boldsymbol{\beta}|^2 - \frac{1}{4\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} + \boldsymbol{\beta}^\top\mathbf{y}$$

Maximizing this (or equivalently, setting its gradient to zero): $$-\frac{1}{2}\boldsymbol{\beta} - \frac{1}{2\lambda}\mathbf{K}\boldsymbol{\beta} + \mathbf{y} = \mathbf{0}$$ $$\left(\mathbf{I} + \frac{1}{\lambda}\mathbf{K}\right)\boldsymbol{\beta} = 2\mathbf{y}$$ $$(\mathbf{K} + \lambda\mathbf{I})\boldsymbol{\beta} = 2\lambda\mathbf{y}$$

With $\boldsymbol{\alpha} = -\frac{1}{2\lambda}\boldsymbol{\beta}$, we recover: $$\boldsymbol{\alpha} = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$$

The same solution, derived through a different lens!

Strong Duality

Since ridge regression is a convex problem with a strictly feasible solution, strong duality holds: the optimal values of primal and dual are equal. This guarantees that the dual solution truly optimizes the original problem, not just some lower bound.

Summary and What's Next

We've established the mathematical foundation for kernel ridge regression by deriving and analyzing the dual formulation. Let's consolidate what we've learned:

Key Takeaways

•The Representer Theorem guarantees that optimal solutions lie in the span of training examples, reducing optimization from D dimensions to n dimensions.
•The dual formulation expresses the solution as $\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$, depending only on the kernel matrix.
•Primal and dual are equivalent — the Woodbury identity proves they yield identical predictions.
•The kernel trick emerges naturally — all computations reduce to kernel evaluations, enabling work in arbitrarily high-dimensional (even infinite-dimensional) feature spaces.
•Geometric intuition — primal and dual are different coordinate systems for the same subspace; the dual uses training points as basis vectors.

What's Next:

In the next page, we'll examine the kernel matrix in depth. This $n \times n$ matrix $\mathbf{K}$ is the central object in kernel ridge regression. We'll explore:

What properties must $\mathbf{K}$ satisfy to be a valid kernel matrix?
How do different kernels produce different matrix structures?
What does the eigenstructure of $\mathbf{K}$ tell us about our problem?
How does conditioning affect numerical stability?

Page Complete

You now understand the dual formulation of ridge regression—the key insight that enables kernel methods. This mathematical machinery transforms intractable high-dimensional optimization into tractable n × n linear algebra, opening the door to powerful nonlinear regression in arbitrary feature spaces.

1 / 5

Loading learning content...

Machine LearningKernel Ridge Regression

Kernel Ridge Regression

LevelAdvanced

Duration90 mins

TopicKernel Ridge Regression

1 / 5

Dual Formulation

The Gateway to Kernel Methods

But here's the critical question: How do we reformulate ridge regression so that it depends only on inner products?

What You Will Learn

The Primal Problem Revisited

Let's begin by establishing our notation precisely and recalling the primal formulation of ridge regression.

Setup:

Training data: ${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$
Feature transformation: $\phi: \mathbb{R}^d \to \mathcal{H}$ mapping inputs to a (potentially high-dimensional) feature space $\mathcal{H}$
Design matrix in feature space: $\mathbf{\Phi} \in \mathbb{R}^{n \times D}$ where row $i$ is $\phi(\mathbf{x}_i)^\top$ and $D = \dim(\mathcal{H})$

The primal ridge regression objective seeks weights $\mathbf{w} \in \mathcal{H}$ that minimize:

Primal Objective Function

$$\mathcal{L}_{\text{primal}}(\mathbf{w}) = |\mathbf{\Phi}\mathbf{w} - \mathbf{y}|^2 + \lambda |\mathbf{w}|^2$$

Setting the gradient to zero, we derived the primal solution:

$$\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I}_D)^{-1} \mathbf{\Phi}^\top \mathbf{y}$$

This solution has several important characteristics:

Dimensionality dependence: The matrix $\mathbf{\Phi}^\top \mathbf{\Phi}$ is $D \times D$, so inversion costs $O(D^3)$
Explicit feature requirement: We need to compute $\phi(\mathbf{x}_i)$ for each training point
Feature space storage: The weight vector $\mathbf{w}$ has $D$ components

When $D$ is large (or infinite, as with RBF kernels), this primal form becomes computationally intractable or mathematically undefined. We need a different approach.

The Computational Challenge

The Representer Theorem

The Representer Theorem (Informal)

For a broad class of regularized learning problems (including ridge regression), the optimal solution $\mathbf{w}^*$ can be written as a linear combination of the feature-mapped training points:

$$\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i) = \mathbf{\Phi}^\top \boldsymbol{\alpha}$$

for some coefficients $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n$.

Formal Statement:

where:

$L$ is any loss function depending on predictions and targets
$\Omega: \mathbb{R}^+ \to \mathbb{R}$ is a strictly increasing regularization function

Then there exist $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$ such that $\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i)$.

Proof Sketch:

since $\mathbf{w}_\perp \perp \phi(\mathbf{x}_j)$.

But $|\mathbf{w}|^2 = |\mathbf{w}\parallel|^2 + |\mathbf{w}\perp|^2 \geq |\mathbf{w}_\parallel|^2$.

Why This Matters

Deriving the Dual Formulation

Now we perform the key derivation: substituting the representer form $\mathbf{w} = \mathbf{\Phi}^\top \boldsymbol{\alpha}$ into the primal objective.

Step 1: Substitute into the prediction term

The predicted values on training data become: $$\mathbf{\Phi} \mathbf{w} = \mathbf{\Phi} \mathbf{\Phi}^\top \boldsymbol{\alpha} = \mathbf{K} \boldsymbol{\alpha}$$

where $\mathbf{K} = \mathbf{\Phi} \mathbf{\Phi}^\top$ is the kernel matrix (Gram matrix) with entries: $$K_{ij} = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j) = k(\mathbf{x}_i, \mathbf{x}_j)$$

Step 2: Substitute into the regularization term

Step 3: Write the dual objective

Dual Objective Function

$$\mathcal{L}_{\text{dual}}(\boldsymbol{\alpha}) = |\mathbf{K}\boldsymbol{\alpha} - \mathbf{y}|^2 + \lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$

$$= \boldsymbol{\alpha}^\top (\mathbf{K}^2 + \lambda \mathbf{K}) \boldsymbol{\alpha} - 2\mathbf{y}^\top \mathbf{K} \boldsymbol{\alpha} + \text{const}$$

Step 4: Solve by setting gradient to zero

$$\nabla_\alpha \mathcal{L}_{\text{dual}} = 2(\mathbf{K}^2 + \lambda \mathbf{K})\boldsymbol{\alpha} - 2\mathbf{K}\mathbf{y} = \mathbf{0}$$

If $\mathbf{K}$ has full rank (or we can work in its column space), we get: $$(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{y}$$

The Dual Solution

$$\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I}_n)^{-1} \mathbf{y}$$

This is the kernel trick in action for ridge regression.

Equivalence of Primal and Dual Solutions

A natural question arises: Are the primal and dual solutions truly equivalent? They're derived differently—do they produce the same predictions?

The answer is a resounding yes. Let's prove this rigorously.

From Dual to Primal:

From Primal to Dual:

The primal solution is: $$\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I}_D)^{-1} \mathbf{\Phi}^\top \mathbf{y}$$

Claim: These expressions yield identical predictions.

The Woodbury Identity

For matrices A, B, C of compatible dimensions with A and C invertible:

$$(\mathbf{A} + \mathbf{B}\mathbf{C}^{-1}\mathbf{B}^\top)^{-1}\mathbf{B}\mathbf{C}^{-1} = \mathbf{A}^{-1}\mathbf{B}(\mathbf{C} + \mathbf{B}^\top\mathbf{A}^{-1}\mathbf{B})^{-1}$$

This identity (a consequence of the matrix inversion lemma) lets us convert between primal and dual forms.

Proof of Equivalence:

We need to show that for any test point $\mathbf{x}*$: $$\phi(\mathbf{x})^\top \mathbf{w}^{\text{primal}} = \phi(\mathbf{x})^\top \mathbf{w}^_{\text{dual}}$$

Using the Woodbury identity with $\mathbf{A} = \lambda \mathbf{I}_D$ and $\mathbf{B} = \mathbf{\Phi}^\top$, $\mathbf{C} = \mathbf{I}_n$:

The solutions are identical. $\square$

Numerical Perspective

Geometric Interpretation

The duality between primal and dual formulations has a beautiful geometric interpretation that deepens our understanding of what's happening.

The Primal View:

The regularization $|\mathbf{w}|^2$ penalizes directions in $\mathcal{H}$ that are "expensive" according to the norm.

The Dual View:

In the dual, we're expressing the same function as: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}i, \mathbf{x}) = \sum{i=1}^n \alpha_i \phi(\mathbf{x}_i)^\top \phi(\mathbf{x})$$

This represents $f$ as a weighted combination of "basis functions" centered at each training point. The coefficient $\alpha_i$ determines how much training example $i$ contributes to the prediction.

Primal Perspective

•Optimization space: D-dimensional feature space
•Variables: Weight vector w ∈ ℝ^D
•Computation: O(D³) matrix inversion
•Interpretation: Hyperplane in feature space
•Best when: D << n (low-dimensional features)

Dual Perspective

•Optimization space: n-dimensional coefficient space
•Variables: Dual coefficients α ∈ ℝ^n
•Computation: O(n³) matrix inversion
•Interpretation: Linear combo of training kernels
•Best when: n << D (high-dimensional features)

The Subspace Relationship:

The Representer Theorem tells us that $\mathbf{w}^* \in \text{span}{\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)}$. This subspace has dimension at most $n$, regardless of $D$.

The Key Geometric Insight

Why the Kernel Trick Works

We're now in a position to see precisely why the kernel trick is so powerful for ridge regression.

Observation 1: The dual solution depends only on K

$$\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$

To compute $\boldsymbol{\alpha}^*$, we never need explicit features $\phi(\mathbf{x}_i)$. We only need:

The kernel matrix $\mathbf{K}$ where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
The target values $\mathbf{y}$
The regularization parameter $\lambda$

Observation 2: Predictions depend only on kernel evaluations

The Prediction Formula

$$f(\mathbf{x}*) = \mathbf{k}^\top \boldsymbol{\alpha}^ = \mathbf{k}_*^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$

where $\mathbf{k}_* = (k(\mathbf{x}1, \mathbf{x}), \ldots, k(\mathbf{x}n, \mathbf{x}))^\top$ is the vector of kernel evaluations between the test point and all training points.

Observation 3: The entire algorithm is "kernelized"

Every step of kernel ridge regression can be performed with only kernel evaluations:

Training: Compute $\mathbf{K}$ via $n^2$ kernel evaluations, then solve $(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{y}$
Prediction: Compute $\mathbf{k}*$ via $n$ kernel evaluations, then return $\mathbf{k}*^\top \boldsymbol{\alpha}$

Observation 4: This works for infinite-dimensional feature spaces

For the RBF kernel: $$k(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2}\right)$$

The corresponding feature space $\mathcal{H}$ is infinite-dimensional. Yet we can:

Compute the kernel matrix $\mathbf{K}$ (just plug into the formula)
Solve the dual (just linear algebra in $\mathbb{R}^n$)
Make predictions (just dot products)

We're effectively working in infinite dimensions without ever constructing an infinite-dimensional vector!

The Miracle of Kernels

Alternative Derivation: Lagrangian Methods

For completeness, let's derive the dual using Lagrangian duality—the classical optimization approach. This provides additional insight and connects to the broader theory of convex optimization.

Reformulate as Constrained Optimization:

where $\boldsymbol{\beta} \in \mathbb{R}^n$ are Lagrange multipliers.

Minimize over primal variables:

$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\xi}} = 2\boldsymbol{\xi} - \boldsymbol{\beta} = \mathbf{0} \implies \boldsymbol{\xi}^* = \frac{1}{2}\boldsymbol{\beta}$$

Substitute back to get dual:

Simplifying: $$g(\boldsymbol{\beta}) = -\frac{1}{4}|\boldsymbol{\beta}|^2 - \frac{1}{4\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} + \boldsymbol{\beta}^\top\mathbf{y}$$

With $\boldsymbol{\alpha} = -\frac{1}{2\lambda}\boldsymbol{\beta}$, we recover: $$\boldsymbol{\alpha} = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$$

The same solution, derived through a different lens!

Strong Duality

Summary and What's Next

We've established the mathematical foundation for kernel ridge regression by deriving and analyzing the dual formulation. Let's consolidate what we've learned:

Key Takeaways

•The Representer Theorem guarantees that optimal solutions lie in the span of training examples, reducing optimization from D dimensions to n dimensions.
•The dual formulation expresses the solution as $\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$, depending only on the kernel matrix.
•Primal and dual are equivalent — the Woodbury identity proves they yield identical predictions.
•The kernel trick emerges naturally — all computations reduce to kernel evaluations, enabling work in arbitrarily high-dimensional (even infinite-dimensional) feature spaces.
•Geometric intuition — primal and dual are different coordinate systems for the same subspace; the dual uses training points as basis vectors.

What's Next:

In the next page, we'll examine the kernel matrix in depth. This $n \times n$ matrix $\mathbf{K}$ is the central object in kernel ridge regression. We'll explore:

What properties must $\mathbf{K}$ satisfy to be a valid kernel matrix?
How do different kernels produce different matrix structures?
What does the eigenstructure of $\mathbf{K}$ tell us about our problem?
How does conditioning affect numerical stability?

Page Complete

1 / 5