Loading learning content...
In the previous module, we explored the kernel trick—the remarkable mathematical insight that allows us to compute inner products in high-dimensional (possibly infinite-dimensional) feature spaces without explicitly constructing those spaces. We saw that this trick hinges on expressing algorithms solely in terms of inner products between data points.
But here's the critical question: How do we reformulate ridge regression so that it depends only on inner products?
This is where the dual formulation becomes essential. The primal form of ridge regression expresses the solution in terms of the weight vector $\mathbf{w}$, which lives in feature space. The dual form expresses the same solution in terms of coefficients that weight the training examples themselves. This shift in perspective—from feature-centric to example-centric—is what unlocks the full power of kernel methods.
By the end of this page, you will: • Derive the dual formulation of ridge regression from first principles • Understand the Representer Theorem and its profound implications • See exactly how the kernel trick naturally emerges from the dual • Grasp the geometric intuition behind the primal-dual relationship • Recognize when and why working in the dual is computationally advantageous
Let's begin by establishing our notation precisely and recalling the primal formulation of ridge regression.
Setup:
The primal ridge regression objective seeks weights $\mathbf{w} \in \mathcal{H}$ that minimize:
$$\mathcal{L}_{\text{primal}}(\mathbf{w}) = |\mathbf{\Phi}\mathbf{w} - \mathbf{y}|^2 + \lambda |\mathbf{w}|^2$$
where $\lambda > 0$ is the regularization parameter, $\mathbf{y} = (y_1, \ldots, y_n)^\top$ is the target vector, and the first term measures training error while the second penalizes model complexity.
Setting the gradient to zero, we derived the primal solution:
$$\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I}_D)^{-1} \mathbf{\Phi}^\top \mathbf{y}$$
This solution has several important characteristics:
When $D$ is large (or infinite, as with RBF kernels), this primal form becomes computationally intractable or mathematically undefined. We need a different approach.
Consider a polynomial kernel of degree 5 on 1000-dimensional input. The explicit feature space has $\binom{1000+5}{5} \approx 8.4 \times 10^{12}$ dimensions. No computer can store or invert an $8.4$ trillion × $8.4$ trillion matrix. Yet kernel ridge regression handles this effortlessly—the dual formulation is why.
Before deriving the dual, we establish a foundational result that explains why the dual form is even possible. This result, known as the Representer Theorem, is one of the most important theorems in kernel methods and regularization theory.
Intuition: When we solve a regularized learning problem, we're balancing fit to the data against model complexity. The theorem tells us that no matter how rich our feature space is, the optimal solution always lies in a finite-dimensional subspace spanned by the training examples themselves.
For a broad class of regularized learning problems (including ridge regression), the optimal solution $\mathbf{w}^*$ can be written as a linear combination of the feature-mapped training points:
$$\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i) = \mathbf{\Phi}^\top \boldsymbol{\alpha}$$
for some coefficients $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n$.
Formal Statement:
Consider the optimization problem: $$\min_{\mathbf{w} \in \mathcal{H}} \left[ L(\mathbf{w}^\top \phi(\mathbf{x}_1), \ldots, \mathbf{w}^\top \phi(\mathbf{x}_n), y_1, \ldots, y_n) + \Omega(|\mathbf{w}|) \right]$$
where:
Then there exist $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$ such that $\mathbf{w}^* = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i)$.
Proof Sketch:
Decompose any $\mathbf{w}$ as: $$\mathbf{w} = \underbrace{\sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}i)}{\mathbf{w}\parallel \in \text{span}{\phi(\mathbf{x}i)}} + \underbrace{\mathbf{w}\perp}{\perp \text{ to span}}$$
Critically, predictions depend only on $\mathbf{w}_\parallel$: $$\mathbf{w}^\top \phi(\mathbf{x}j) = \mathbf{w}\parallel^\top \phi(\mathbf{x}j) + \mathbf{w}\perp^\top \phi(\mathbf{x}j) = \mathbf{w}\parallel^\top \phi(\mathbf{x}_j)$$
since $\mathbf{w}_\perp \perp \phi(\mathbf{x}_j)$.
But $|\mathbf{w}|^2 = |\mathbf{w}\parallel|^2 + |\mathbf{w}\perp|^2 \geq |\mathbf{w}_\parallel|^2$.
Thus, $\mathbf{w}\perp$ adds to the regularization penalty without affecting predictions. Since $\Omega$ is strictly increasing, the optimal solution must have $\mathbf{w}\perp = \mathbf{0}$. $\square$
The Representer Theorem is profound: even if your feature space has billions (or infinitely many) dimensions, the optimal weights live in an n-dimensional subspace determined entirely by your n training examples. This is why we can switch from optimizing over w ∈ ℝ^D to optimizing over α ∈ ℝ^n—a dramatic reduction when n << D.
Now we perform the key derivation: substituting the representer form $\mathbf{w} = \mathbf{\Phi}^\top \boldsymbol{\alpha}$ into the primal objective.
Step 1: Substitute into the prediction term
The predicted values on training data become: $$\mathbf{\Phi} \mathbf{w} = \mathbf{\Phi} \mathbf{\Phi}^\top \boldsymbol{\alpha} = \mathbf{K} \boldsymbol{\alpha}$$
where $\mathbf{K} = \mathbf{\Phi} \mathbf{\Phi}^\top$ is the kernel matrix (Gram matrix) with entries: $$K_{ij} = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j) = k(\mathbf{x}_i, \mathbf{x}_j)$$
Step 2: Substitute into the regularization term
$$|\mathbf{w}|^2 = \mathbf{w}^\top \mathbf{w} = (\mathbf{\Phi}^\top \boldsymbol{\alpha})^\top (\mathbf{\Phi}^\top \boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{\Phi} \mathbf{\Phi}^\top \boldsymbol{\alpha} = \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$
Step 3: Write the dual objective
$$\mathcal{L}_{\text{dual}}(\boldsymbol{\alpha}) = |\mathbf{K}\boldsymbol{\alpha} - \mathbf{y}|^2 + \lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$
Expanding: $$= \boldsymbol{\alpha}^\top \mathbf{K}^2 \boldsymbol{\alpha} - 2\mathbf{y}^\top \mathbf{K} \boldsymbol{\alpha} + \mathbf{y}^\top \mathbf{y} + \lambda \boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha}$$
$$= \boldsymbol{\alpha}^\top (\mathbf{K}^2 + \lambda \mathbf{K}) \boldsymbol{\alpha} - 2\mathbf{y}^\top \mathbf{K} \boldsymbol{\alpha} + \text{const}$$
Step 4: Solve by setting gradient to zero
$$\nabla_\alpha \mathcal{L}_{\text{dual}} = 2(\mathbf{K}^2 + \lambda \mathbf{K})\boldsymbol{\alpha} - 2\mathbf{K}\mathbf{y} = \mathbf{0}$$
Since $\mathbf{K}$ is positive semi-definite, for $\lambda > 0$, we have $(\mathbf{K} + \lambda \mathbf{I})$ invertible. Thus: $$\mathbf{K}(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{K}\mathbf{y}$$
If $\mathbf{K}$ has full rank (or we can work in its column space), we get: $$(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{y}$$
$$\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I}_n)^{-1} \mathbf{y}$$
This is the closed-form solution for kernel ridge regression in dual form. Notice: • The matrix is n × n, not D × D • It depends only on kernel evaluations k(xᵢ, xⱼ) • No explicit features φ(x) appear anywhere
The key insight: We've completely eliminated the feature mapping $\phi$ from both the optimization and the solution. All that matters is the kernel matrix $\mathbf{K}$, which can be computed directly using the kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$.
This is the kernel trick in action for ridge regression.
A natural question arises: Are the primal and dual solutions truly equivalent? They're derived differently—do they produce the same predictions?
The answer is a resounding yes. Let's prove this rigorously.
From Dual to Primal:
Given $\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$, the corresponding weight vector is: $$\mathbf{w}^* = \mathbf{\Phi}^\top \boldsymbol{\alpha}^* = \mathbf{\Phi}^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$
From Primal to Dual:
The primal solution is: $$\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I}_D)^{-1} \mathbf{\Phi}^\top \mathbf{y}$$
Claim: These expressions yield identical predictions.
For matrices A, B, C of compatible dimensions with A and C invertible:
$$(\mathbf{A} + \mathbf{B}\mathbf{C}^{-1}\mathbf{B}^\top)^{-1}\mathbf{B}\mathbf{C}^{-1} = \mathbf{A}^{-1}\mathbf{B}(\mathbf{C} + \mathbf{B}^\top\mathbf{A}^{-1}\mathbf{B})^{-1}$$
This identity (a consequence of the matrix inversion lemma) lets us convert between primal and dual forms.
Proof of Equivalence:
We need to show that for any test point $\mathbf{x}*$: $$\phi(\mathbf{x})^\top \mathbf{w}^{\text{primal}} = \phi(\mathbf{x})^\top \mathbf{w}^_{\text{dual}}$$
Using the Woodbury identity with $\mathbf{A} = \lambda \mathbf{I}_D$ and $\mathbf{B} = \mathbf{\Phi}^\top$, $\mathbf{C} = \mathbf{I}_n$:
$$(\lambda \mathbf{I} + \mathbf{\Phi}^\top \mathbf{\Phi})^{-1} \mathbf{\Phi}^\top = \frac{1}{\lambda}\mathbf{\Phi}^\top (\mathbf{I} + \frac{1}{\lambda}\mathbf{\Phi}\mathbf{\Phi}^\top)^{-1}$$ $$= \mathbf{\Phi}^\top (\lambda \mathbf{I} + \mathbf{\Phi}\mathbf{\Phi}^\top)^{-1} = \mathbf{\Phi}^\top (\lambda \mathbf{I} + \mathbf{K})^{-1}$$
Therefore: $$\mathbf{w}^_{\text{primal}} = (\mathbf{\Phi}^\top \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^\top \mathbf{y} = \mathbf{\Phi}^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y} = \mathbf{w}^_{\text{dual}}$$
The solutions are identical. $\square$
While mathematically equivalent, the primal and dual may differ numerically due to finite precision arithmetic. The dual is numerically preferable when n << D (fewer computations, smaller matrices), while the primal is preferable when D << n. The crossover point depends on implementation details and problem conditioning.
The duality between primal and dual formulations has a beautiful geometric interpretation that deepens our understanding of what's happening.
The Primal View:
In the primal, we're finding weights $\mathbf{w} \in \mathcal{H}$ that define a linear function $f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})$. The space of all such functions is the full feature space $\mathcal{H}$, which may have dimension $D \gg n$ or even be infinite-dimensional.
The regularization $|\mathbf{w}|^2$ penalizes directions in $\mathcal{H}$ that are "expensive" according to the norm.
The Dual View:
In the dual, we're expressing the same function as: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}i, \mathbf{x}) = \sum{i=1}^n \alpha_i \phi(\mathbf{x}_i)^\top \phi(\mathbf{x})$$
This represents $f$ as a weighted combination of "basis functions" centered at each training point. The coefficient $\alpha_i$ determines how much training example $i$ contributes to the prediction.
The Subspace Relationship:
The Representer Theorem tells us that $\mathbf{w}^* \in \text{span}{\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)}$. This subspace has dimension at most $n$, regardless of $D$.
Geometrically, imagine the feature space $\mathcal{H}$ as a vast arena. The training examples $\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)$ define a small "tent" within this arena—an $n$-dimensional subspace. The Representer Theorem guarantees that the optimal $\mathbf{w}^*$ lies within this tent.
The dual optimization directly parameterizes vectors in this tent via the coordinates $\boldsymbol{\alpha}$. This is why it's computationally efficient: we're searching an $n$-dimensional space instead of a $D$-dimensional one.
Primal and dual are just different coordinate systems for the same solution. The primal uses the standard basis of the feature space; the dual uses the training examples as basis vectors. When the training examples span a much smaller space than the full feature space, the dual coordinates are more efficient.
We're now in a position to see precisely why the kernel trick is so powerful for ridge regression.
Observation 1: The dual solution depends only on K
$$\boldsymbol{\alpha}^* = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$
To compute $\boldsymbol{\alpha}^*$, we never need explicit features $\phi(\mathbf{x}_i)$. We only need:
Observation 2: Predictions depend only on kernel evaluations
For a test point $\mathbf{x}*$: $$f(\mathbf{x}) = \mathbf{w}^{\top} \phi(\mathbf{x}*) = \sum{i=1}^n \alpha_i^* \phi(\mathbf{x}i)^\top \phi(\mathbf{x}) = \sum_{i=1}^n \alpha_i^ k(\mathbf{x}i, \mathbf{x}*)$$
$$f(\mathbf{x}*) = \mathbf{k}^\top \boldsymbol{\alpha}^ = \mathbf{k}_*^\top (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}$$
where $\mathbf{k}_* = (k(\mathbf{x}1, \mathbf{x}), \ldots, k(\mathbf{x}n, \mathbf{x}))^\top$ is the vector of kernel evaluations between the test point and all training points.
Observation 3: The entire algorithm is "kernelized"
Every step of kernel ridge regression can be performed with only kernel evaluations:
Observation 4: This works for infinite-dimensional feature spaces
For the RBF kernel: $$k(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2}\right)$$
The corresponding feature space $\mathcal{H}$ is infinite-dimensional. Yet we can:
We're effectively working in infinite dimensions without ever constructing an infinite-dimensional vector!
The dual formulation transforms a potentially impossible optimization in D-dimensional feature space into a tractable n × n linear system. When D = ∞ (as with RBF kernels), this isn't just convenient—it's the only way to proceed. The kernel trick makes the impossible possible.
For completeness, let's derive the dual using Lagrangian duality—the classical optimization approach. This provides additional insight and connects to the broader theory of convex optimization.
Reformulate as Constrained Optimization:
Rewrite ridge regression with an auxiliary variable: $$\min_{\mathbf{w}, \boldsymbol{\xi}} |\boldsymbol{\xi}|^2 + \lambda |\mathbf{w}|^2 \quad \text{s.t.} \quad \boldsymbol{\xi} = \mathbf{\Phi}\mathbf{w} - \mathbf{y}$$
Form the Lagrangian: $$\mathcal{L}(\mathbf{w}, \boldsymbol{\xi}, \boldsymbol{\beta}) = |\boldsymbol{\xi}|^2 + \lambda |\mathbf{w}|^2 + \boldsymbol{\beta}^\top(\mathbf{\Phi}\mathbf{w} - \mathbf{y} - \boldsymbol{\xi})$$
where $\boldsymbol{\beta} \in \mathbb{R}^n$ are Lagrange multipliers.
Minimize over primal variables:
$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\xi}} = 2\boldsymbol{\xi} - \boldsymbol{\beta} = \mathbf{0} \implies \boldsymbol{\xi}^* = \frac{1}{2}\boldsymbol{\beta}$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\lambda \mathbf{w} + \mathbf{\Phi}^\top \boldsymbol{\beta} = \mathbf{0} \implies \mathbf{w}^* = -\frac{1}{2\lambda}\mathbf{\Phi}^\top \boldsymbol{\beta}$$
Substitute back to get dual:
$$g(\boldsymbol{\beta}) = \frac{1}{4}|\boldsymbol{\beta}|^2 + \frac{1}{4\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} - \frac{1}{2\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} + \boldsymbol{\beta}^\top\mathbf{y} - \frac{1}{2}|\boldsymbol{\beta}|^2$$
Simplifying: $$g(\boldsymbol{\beta}) = -\frac{1}{4}|\boldsymbol{\beta}|^2 - \frac{1}{4\lambda}\boldsymbol{\beta}^\top \mathbf{K} \boldsymbol{\beta} + \boldsymbol{\beta}^\top\mathbf{y}$$
Maximizing this (or equivalently, setting its gradient to zero): $$-\frac{1}{2}\boldsymbol{\beta} - \frac{1}{2\lambda}\mathbf{K}\boldsymbol{\beta} + \mathbf{y} = \mathbf{0}$$ $$\left(\mathbf{I} + \frac{1}{\lambda}\mathbf{K}\right)\boldsymbol{\beta} = 2\mathbf{y}$$ $$(\mathbf{K} + \lambda\mathbf{I})\boldsymbol{\beta} = 2\lambda\mathbf{y}$$
With $\boldsymbol{\alpha} = -\frac{1}{2\lambda}\boldsymbol{\beta}$, we recover: $$\boldsymbol{\alpha} = (\mathbf{K} + \lambda\mathbf{I})^{-1}\mathbf{y}$$
The same solution, derived through a different lens!
Since ridge regression is a convex problem with a strictly feasible solution, strong duality holds: the optimal values of primal and dual are equal. This guarantees that the dual solution truly optimizes the original problem, not just some lower bound.
We've established the mathematical foundation for kernel ridge regression by deriving and analyzing the dual formulation. Let's consolidate what we've learned:
What's Next:
In the next page, we'll examine the kernel matrix in depth. This $n \times n$ matrix $\mathbf{K}$ is the central object in kernel ridge regression. We'll explore:
You now understand the dual formulation of ridge regression—the key insight that enables kernel methods. This mathematical machinery transforms intractable high-dimensional optimization into tractable n × n linear algebra, opening the door to powerful nonlinear regression in arbitrary feature spaces.