Machine LearningKernel Methods for Regression

The Kernel Trick

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

1 / 5

Feature Space Mapping

The Limitation of Linearity

Throughout our journey in regression, we have relied heavily on linear models—methods that express predictions as weighted combinations of input features. Linear regression, ridge regression, and lasso all share this fundamental assumption: the relationship between inputs and outputs can be captured by a linear function.

But here lies a profound limitation: the real world is rarely linear.

Consider these everyday phenomena:

The trajectory of a thrown ball follows a parabola, not a straight line
Drug dosage effects often follow sigmoidal curves with diminishing returns
Economic relationships exhibit complex interactions and threshold effects
Audio signals contain harmonic frequencies that multiply rather than add

When we attempt to model these nonlinear relationships with linear methods, we are fundamentally limited—not by our algorithmic prowess, but by our representation of the problem.

What You Will Learn

By the end of this page, you will understand why mapping data to higher-dimensional feature spaces enables linear methods to capture nonlinear patterns. You'll see how this innocent-seeming transformation—feature space mapping—forms the conceptual foundation for kernel methods, one of the most powerful ideas in machine learning.

The Linearity Constraint in Depth

Let us formalize what we mean by 'linear' and understand precisely where this constraint binds us.

The Linear Model Structure

A linear model in its most general form predicts:

$$\hat{y} = f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b = \sum_{j=1}^{d} w_j x_j + b$$

where:

$\mathbf{x} \in \mathbb{R}^d$ is the input feature vector
$\mathbf{w} \in \mathbb{R}^d$ is the weight vector
$b \in \mathbb{R}$ is the bias term

This model is called 'linear' because $f(\mathbf{x})$ is a linear function of $\mathbf{x}$—technically, an affine function since we include the bias. The crucial property is that the prediction changes proportionally and additively with each input feature.

Linear in Parameters vs. Linear in Inputs

An important distinction: 'linear model' typically means linear in the parameters (weights). A model like $y = w_1 x^2 + w_2 x + b$ is linear in its parameters ($w_1, w_2, b$) even though it's quadratic in $x$. This distinction becomes crucial when we discuss feature mappings.

Geometric Interpretation

Geometrically, a linear model in $d$ dimensions defines a hyperplane that separates or fits the data:

In 2D: the model is a line $y = w_1 x_1 + w_2 x_2 + b$
In 3D: the model is a plane $y = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$
In $d$ dimensions: a $(d-1)$-dimensional hyperplane

The fundamental limitation is clear: a hyperplane can only capture relationships that are themselves planar. When the true relationship curves, bends, or twists through space, no hyperplane—no matter how cleverly positioned—can adequately describe it.

linear_limitation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import matplotlib.pyplot as plt
 
# Generate nonlinear data: y = x^2 + noise
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y_true = X.ravel()**2
y = y_true + np.random.normal(0, 0.5, 100)
 
# Fit a linear model (best possible line)
from numpy.linalg import lstsq
X_design = np.column_stack([np.ones(100), X])
w_linear = lstsq(X_design, y, rcond=None)[0]
y_pred_linear = X_design @ w_linear
 
# Calculate residual error
mse_linear = np.mean((y - y_pred_linear)**2)
print(f"Linear model MSE: {mse_linear:.4f}")
 
# The linear model achieves MSE ~3.0 on this data
# It cannot capture the parabolic structure
# The best line goes through the middle but misses the curvature

The XOR Problem: A Classic Example

The limitations of linearity are starkly illustrated by the XOR problem—a pattern that is trivially simple for nonlinear methods but fundamentally impossible for any linear classifier:

Points at $(0,0)$ and $(1,1)$ belong to class 0
Points at $(0,1)$ and $(1,0)$ belong to class 1

No single line can separate these four points by class. This is because the XOR function is not linearly separable—there exists no hyperplane that places all class-0 points on one side and all class-1 points on the other.

This example, though simple, represents a fundamental barrier that appears throughout machine learning whenever the underlying relationship involves interactions, thresholds, or complex dependencies between features.

The Feature Transformation Insight

Here is the key insight that unlocks nonlinear modeling with linear methods:

If the data is not linearly separable in its current representation, perhaps it becomes linearly separable in a different representation.

This deceptively simple idea—transforming the data before applying a linear method—is the foundation of feature space mapping and, ultimately, kernel methods.

The Core Strategy

Instead of fitting a linear model to the original features $\mathbf{x}$, we:

Define a feature map $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D$ that transforms inputs into a new space
Fit a linear model in the transformed space: $f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})$
The model is linear in $\phi(\mathbf{x})$ but nonlinear in $\mathbf{x}$

The Magic of Representation

The function $\phi$ does not change the underlying problem—the same data points, the same labels. What changes is our representation of the problem. A curve in $\mathbb{R}^2$ might unfold into a straight line in $\mathbb{R}^{10}$. The complexity hasn't disappeared; it has been absorbed into the transformation.

Solving XOR with Feature Transformation

Let us return to the XOR problem. With original features $\mathbf{x} = (x_1, x_2)$, no linear separator exists. But consider this feature map:

$$\phi(x_1, x_2) = (x_1, x_2, x_1 \cdot x_2)$$

We have added a third dimension: the product of the two original features. Now let's examine the transformed points:

Original $(x_1, x_2)$	Class	Transformed $(x_1, x_2, x_1 x_2)$
$(0, 0)$	0	$(0, 0, 0)$
$(1, 1)$	0	$(1, 1, 1)$
$(0, 1)$	1	$(0, 1, 0)$
$(1, 0)$	1	$(1, 0, 0)$

In the 3D transformed space, the class-0 points have $x_1 x_2 \in {0, 1}$ spanning both extremes, while class-1 points have $x_1 x_2 = 0$. A hyperplane like $x_1 x_2 = 0.5$ now separates the classes perfectly!

The XOR problem, unsolvable by any linear method in the original space, becomes trivially solvable after the right transformation.

xor_feature_transform.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
 
# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])  # XOR labels
 
# Original space: check if linearly separable
# Any line ax + by + c = 0 fails to separate
print("Original space: NOT linearly separable")
 
# Feature transformation: add interaction term
def phi(X):
    """Map (x1, x2) -> (x1, x2, x1*x2)"""
    x1, x2 = X[:, 0], X[:, 1]
    return np.column_stack([x1, x2, x1 * x2])
 
X_transformed = phi(X)
print("Transformed features:")
print(X_transformed)
 
# In transformed space: w = [0, 0, 2], b = -1 separates
# Decision: 2*x1*x2 - 1 > 0 → class 1 (wrong for XOR)
# Actually: -2*x1*x2 + 1 > 0 → class 0
# The hyperplane x1*x2 = 0.5 separates classes
 
w = np.array([0, 0, -2])
b = 1
predictions = (X_transformed @ w + b) > 0
print(f"Predictions: {predictions.astype(int)}")
print(f"Labels:      {y}")
print(f"Accuracy: {np.mean(predictions == y):.1%}")

Polynomial Feature Expansion

The most natural and widely used feature transformation is polynomial expansion—constructing new features as polynomial combinations of the original features.

Univariate Polynomial Features

For a single input $x$, a polynomial feature map of degree $p$ is:

$$\phi_p(x) = (1, x, x^2, x^3, \ldots, x^p)$$

A linear model in this transformed space:

$$f(x) = w_0 + w_1 x + w_2 x^2 + \cdots + w_p x^p$$

is a polynomial of degree $p$ in $x$. By choosing $p$ large enough, we can approximate any continuous function arbitrarily well (by the Weierstrass approximation theorem).

Multivariate Polynomial Features

For $d$-dimensional inputs, polynomial features include all monomials up to degree $p$:

$$\phi_p(\mathbf{x}) = (1, x_1, x_2, \ldots, x_d, x_1^2, x_1 x_2, x_1 x_3, \ldots, x_d^p)$$

This includes:

All individual features: $x_j$
All pairwise products: $x_i x_j$
All higher-order interactions: $x_i x_j x_k$, etc.
All pure powers: $x_j^2, x_j^3, \ldots$

The Curse of Dimensionality Lurks

The number of polynomial features grows explosively. For $d$ input features and degree $p$, the number of features is $\binom{d + p}{p}$. With $d = 100$ features and $p = 3$, this exceeds 176,000 features. With $p = 5$, it exceeds 96 million! This explosion is where the kernel trick will save us.

Dimension of Polynomial Feature Spaces

The dimensionality $D$ of the polynomial feature space is given by:

$$D = \binom{d + p}{p} = \frac{(d + p)!}{d! \cdot p!}$$

Some examples:

Input dimension $d$	Degree $p$	Feature dimension $D$
2	2	6
2	3	10
10	2	66
10	3	286
100	2	5,151
100	3	176,851
100	5	96,560,646

The exponential growth makes explicit feature computation infeasible for high degrees or high-dimensional inputs.

polynomial_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from itertools import combinations_with_replacement
from math import comb
 
def polynomial_features(X, degree):
    """
    Compute polynomial features up to given degree.
    
    For X with shape (n_samples, n_features), returns
    all polynomial combinations up to 'degree'.
    """
    n_samples, n_features = X.shape
    
    # Calculate output dimension
    n_output_features = comb(n_features + degree, degree)
    print(f"Input: {n_features} features -> Output: {n_output_features} features")
    
    features = []
    
    # Degree 0: constant term
    features.append(np.ones(n_samples))
    
    # Degrees 1 through p
    for d in range(1, degree + 1):
        # All combinations of d features (with replacement)
        for combo in combinations_with_replacement(range(n_features), d):
            # Multiply the selected features
            feature = np.ones(n_samples)
            for idx in combo:
                feature *= X[:, idx]
            features.append(feature)
    
    return np.column_stack(features)
 
# Example: 2D input, degree 2
X = np.array([[1, 2], [3, 4], [5, 6]])
X_poly = polynomial_features(X, degree=2)
print(f"Shape: {X_poly.shape}")
print("Features: [1, x1, x2, x1^2, x1*x2, x2^2]")
print(X_poly)
 
# For (1, 2): [1, 1, 2, 1, 2, 4]
# For (3, 4): [1, 3, 4, 9, 12, 16]
# For (5, 6): [1, 5, 6, 25, 30, 36]

Why Polynomial Features Work

Polynomial features work because they capture:

Nonlinear effects: Powers like $x^2$ and $x^3$ model curvature
Interaction effects: Products like $x_1 x_2$ capture dependencies between features
Threshold effects: High-degree terms can approximate step-like functions

The Taylor series expansion tells us that any smooth function can be locally approximated by a polynomial. With enough polynomial features and appropriate regularization, we can model arbitrarily complex relationships.

Beyond Polynomials: General Feature Maps

While polynomial features are intuitive, they are just one family among infinitely many possible feature transformations. The concept of feature mapping is far more general.

The Abstract Perspective

A feature map (also called a basis function expansion) is any function:

$$\phi: \mathcal{X} \rightarrow \mathcal{H}$$

that transforms inputs from the original space $\mathcal{X}$ (often $\mathbb{R}^d$) into a feature space $\mathcal{H}$ (often $\mathbb{R}^D$ with $D > d$, or even infinite-dimensional).

The only requirement is that we can subsequently fit a linear model in $\mathcal{H}$:

$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) = \sum_{j=1}^{D} w_j \phi_j(\mathbf{x})$$

Different choices of $\phi$ encode different inductive biases—assumptions about what kinds of functions are likely to fit the data well.

Common Feature Map Families

•Polynomial Features: Monomials $x_1^{a_1} x_2^{a_2} \cdots x_d^{a_d}$ where $\sum a_j \leq p$. Captures smooth nonlinearities and interactions.
•Fourier Features: Sinusoids $\cos(\omega_j^\top \mathbf{x})$ and $\sin(\omega_j^\top \mathbf{x})$ for various frequencies $\omega_j$. Captures periodic patterns.
•Radial Basis Functions (RBF): $\exp(-\gamma |\mathbf{x} - \mathbf{c}_j|^2)$ centered at various points $\mathbf{c}_j$. Captures local bumps of influence.
•Wavelet Features: Localized oscillations at multiple scales. Captures patterns at different resolutions.
•Indicator Features: $\mathbf{1}[\mathbf{x} \in R_j]$ for regions $R_j$. Captures piecewise constant approximations.
•Learned Features: Neural network hidden layers. Captures patterns learned from data.

Radial Basis Function Features

RBF features are particularly important because they connect to the Gaussian kernel we will study later:

$$\phi_\text{RBF}(\mathbf{x}) = \left( \exp\left(-\gamma |\mathbf{x} - \mathbf{c}_1|^2\right), \ldots, \exp\left(-\gamma |\mathbf{x} - \mathbf{c}_m|^2\right) \right)$$

Each feature measures the similarity of $\mathbf{x}$ to a center point $\mathbf{c}_j$. Points near $\mathbf{c}_j$ activate that feature strongly; distant points activate it weakly.

The parameter $\gamma$ controls the bandwidth:

Large $\gamma$: narrow, localized bumps (each center affects only nearby points)
Small $\gamma$: broad, overlapping bumps (each center affects points far away)

With enough centers (potentially one per training point), RBF features can approximate any function.

rbf_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_features(X, centers, gamma):
    """
    Compute RBF features.
    
    Parameters:
    - X: (n_samples, n_features) input data
    - centers: (n_centers, n_features) RBF centers
    - gamma: bandwidth parameter
    
    Returns:
    - (n_samples, n_centers) RBF feature matrix
    """
    # Compute squared distances between each x and each center
    # Using the identity ||a - b||^2 = ||a||^2 + ||b||^2 - 2 a·b
    X_sq = np.sum(X**2, axis=1, keepdims=True)  # (n, 1)
    C_sq = np.sum(centers**2, axis=1, keepdims=True)  # (m, 1)
    XC = X @ centers.T  # (n, m)
    
    sq_dist = X_sq + C_sq.T - 2 * XC  # (n, m)
    return np.exp(-gamma * sq_dist)
 
# Example: nonlinear 1D regression
np.random.seed(42)
X = np.linspace(0, 4*np.pi, 50).reshape(-1, 1)
y = np.sin(X.ravel()) + 0.1 * np.random.randn(50)
 
# Create RBF features using training points as centers
centers = X.copy()  # Each point is a center
gamma = 1.0
Phi = rbf_features(X, centers, gamma)
 
print(f"RBF feature matrix shape: {Phi.shape}")
print(f"Phi[0, 0] (x near center 0): {Phi[0, 0]:.4f}")
print(f"Phi[0, 25] (x far from center 25): {Phi[0, 25]:.6f}")
 
# Fit linear regression in RBF space (with regularization)
lambda_reg = 0.001
w = np.linalg.solve(Phi.T @ Phi + lambda_reg * np.eye(50), Phi.T @ y)
 
# Predictions
y_pred = Phi @ w
mse = np.mean((y - y_pred)**2)
print(f"MSE: {mse:.6f}")  # Near-perfect fit on training data

The High-Dimensional Geometry

To truly understand why feature mapping works, we must develop geometric intuition for what happens when we lift data into higher dimensions.

The Lifting Metaphor

Imagine you have points scattered on a sheet of paper (2D), and they cannot be separated by a straight line. Now imagine crumpling that paper into a ball (embedding it in 3D). Suddenly, a plane that cuts through the ball might separate the points that were inseparable on the flat sheet.

This is precisely what feature mapping does: it lifts data from a low-dimensional space where linear separation is impossible into a higher-dimensional space where it becomes possible.

Mathematical Intuition

Consider points on a circle: $\mathbf{x} = (\cos\theta, \sin\theta)$. In 2D, these points lie on a curve. But with the feature map:

$$\phi(x_1, x_2) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2)$$

the transformed points satisfy:

$$|\phi(\mathbf{x})|^2 = x_1^4 + 2x_1^2 x_2^2 + x_2^4 = (x_1^2 + x_2^2)^2 = 1$$

All points on the unit circle in 2D lie on a specific manifold in feature space. The curved structure in the original space becomes a different (possibly simpler) structure in feature space.

Cover's Theorem on Separability

Cover's theorem (1965) formalizes this intuition: a complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. The probability of linear separability increases with dimensionality—but we must balance this against the danger of overfitting.

The Inner Product Perspective

A crucial insight emerges when we consider inner products in feature space. The inner product between two transformed points is:

$$\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle = \sum_{j=1}^{D} \phi_j(\mathbf{x}) \phi_j(\mathbf{x}')$$

This inner product measures similarity in feature space. Points that are similar (in the transformed representation) have large inner products; dissimilar points have small ones.

Here is the key observation that leads to kernels:

For certain feature maps, the inner product $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ can be computed directly from $\mathbf{x}$ and $\mathbf{x}'$, without ever computing $\phi(\mathbf{x})$ or $\phi(\mathbf{x}')$ explicitly.

This observation is the foundation of the kernel trick, which we will develop in the next page.

inner_product_identity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
 
# Example: degree-2 polynomial feature map in 2D
def phi_poly2(x):
    """
    Feature map: (x1, x2) -> (x1^2, sqrt(2)*x1*x2, x2^2)
    """
    x1, x2 = x[0], x[1]
    return np.array([x1**2, np.sqrt(2) * x1 * x2, x2**2])
 
# Two points
x = np.array([1.0, 2.0])
x_prime = np.array([3.0, 4.0])
 
# Method 1: Compute features, then inner product
phi_x = phi_poly2(x)
phi_x_prime = phi_poly2(x_prime)
inner_prod_explicit = np.dot(phi_x, phi_x_prime)
print(f"Explicit computation:")
print(f"  φ(x) = {phi_x}")
print(f"  φ(x') = {phi_x_prime}")
print(f"  <φ(x), φ(x')> = {inner_prod_explicit}")
 
# Method 2: Direct computation (the kernel trick preview)
# For this specific φ: <φ(x), φ(x')> = (x · x')^2
inner_prod_direct = np.dot(x, x_prime)**2
print(f"
Direct computation:")
print(f"  x · x' = {np.dot(x, x_prime)}")
print(f"  (x · x')^2 = {inner_prod_direct}")
 
# They are equal!
print(f"
Equal? {np.allclose(inner_prod_explicit, inner_prod_direct)}")
 
# This is remarkable: we get the same result without
# ever computing the 3D feature vectors!

The Computational Challenge

We have established that feature mapping enables linear methods to learn nonlinear patterns. However, this power comes with a severe computational burden.

The Explicit Feature Problem

Consider ridge regression with polynomial features of degree $p$. The algorithm requires:

Feature computation: Transform all $n$ training points: $O(n \cdot D)$
Gram matrix: Compute $\Phi^\top \Phi$ where $\Phi$ is $n \times D$: $O(n D^2)$ multiplications
Inversion: Solve the normal equations: $O(D^3)$
Prediction: For each test point, compute $D$-dimensional features: $O(D)$

When $D = \binom{d+p}{p}$ grows exponentially in $p$, even storing the feature matrix becomes infeasible, let alone computing with it.

A Concrete Example

With $d = 100$ input features and degree $p = 5$:

Feature dimension: $D \approx 96$ million
Feature matrix for $n = 10,000$ points: $10^4 \times 10^8 = 10^{12}$ elements
Storage at 8 bytes/double: ~8 terabytes
Matrix multiplication: $10^{24}$ operations

This is utterly impractical. Yet the underlying model—a degree-5 polynomial—might be exactly what we need.

The Barrier We Face

We have a powerful idea (feature expansion) that is computationally infeasible for the cases where we need it most (high-degree polynomials, high-dimensional inputs). We need a way to work in high-dimensional feature spaces without ever constructing the features explicitly.

Infinite-Dimensional Feature Spaces

Some of the most powerful feature maps are not just high-dimensional but infinite-dimensional. The Gaussian RBF feature map, for example, maps each input to an infinite-dimensional vector:

$$\phi_\text{Gaussian}(\mathbf{x}) \in \mathbb{R}^\infty$$

We cannot possibly compute or store infinite-dimensional vectors. Yet, as we will see, we can work with them through kernels.

The Dual Representation Hint

Recall from our study of ridge regression that the solution can be written in dual form:

$$\mathbf{w} = \Phi^\top \boldsymbol{\alpha} = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i)$$

for some coefficients $\boldsymbol{\alpha} \in \mathbb{R}^n$. The weight vector is a linear combination of the training points in feature space.

This means that predictions become:

$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}) \rangle$$

The prediction depends only on inner products between the test point and training points in feature space—not on the features themselves!

This observation is the gateway to the kernel trick: if we can compute these inner products efficiently, we never need the features explicitly.

Explicit Features vs. Kernel Approach
Aspect	Explicit Features	Kernel Approach
Feature computation	O(n · D)	Not needed
Storage	O(n · D)	O(n²)
Works for D → ∞	No	Yes
Computational bottleneck	Feature dimension D	Sample size n
Suited for	Low-dimensional features	High/infinite dimensions

Preview: The Kernel Trick

We have built up to a remarkable insight. Let us state it clearly before developing it fully in the next page.

The Key Observation

For certain feature maps $\phi$, there exists a kernel function $k$ such that:

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$

The kernel computes the inner product in feature space directly from the original inputs, without ever constructing the feature vectors.

Example: Polynomial Kernel

For the degree-2 polynomial feature map we saw earlier, the kernel is simply:

$$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}')^2$$

Computing the left side: one inner product and one squaring — $O(d)$.

Computing the right side explicitly: construct $O(d^2)$ features, then take inner product — $O(d^2)$.

The kernel is quadratically faster, and this gap widens with polynomial degree.

The Kernel Trick Summary

The kernel trick lets us:

Work implicitly in high or infinite-dimensional feature spaces
Compute similarities without constructing features
Use algorithms that depend only on inner products
Achieve computational costs that depend on n (samples) rather than D (feature dimension)

This is one of the most elegant ideas in machine learning.

Algorithms That Work With Kernels

Not all algorithms can exploit the kernel trick—only those whose computations can be expressed entirely in terms of inner products between data points. Fortunately, many important algorithms qualify:

Ridge regression (kernel ridge regression)
Support vector machines (kernel SVM)
Principal component analysis (kernel PCA)
Gaussian processes (built on kernels)
Kernel k-means clustering

These algorithms can be "kernelized" to work in arbitrary feature spaces defined by the kernel.

What Comes Next

In the following pages, we will:

Formally define kernel functions and their properties
Prove that the kernel trick preserves computational correctness
State Mercer's theorem, which characterizes valid kernels
Survey common kernel families and their geometric interpretations
Apply kernels to regression (kernel ridge regression and Gaussian processes)

Summary

We have established the foundational concepts that motivate kernel methods:

The Problem: Linear models cannot capture nonlinear relationships in data. When the true function is curved, kinked, or involves interactions, linear predictions fail.

The Solution: Transform the data using a feature map $\phi$ into a new space where the relationship is linear. A model that is linear in $\phi(\mathbf{x})$ can be arbitrarily nonlinear in $\mathbf{x}$.

The Challenge: Explicit feature computation is infeasible when the feature dimension is very large or infinite.

The Insight: We only need inner products in feature space. If we can compute $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ directly via a kernel $k(\mathbf{x}, \mathbf{x}')$, we bypass the explicit feature computation entirely.

Key Takeaways

•Linear models are limited — they can only capture hyperplane relationships in the input space.
•Feature maps unlock nonlinearity — by transforming inputs, we can apply linear methods to nonlinear problems.
•Polynomial features grow explosively — the dimension $\binom{d+p}{p}$ is impractical for moderate $d$ and $p$.
•Inner products are key — many algorithms depend only on inner products, not on features directly.
•The kernel trick awaits — kernels compute feature-space inner products without explicit features.

Page Complete

You now understand why we need feature space mapping and the computational challenges it presents. In the next page, we will formally introduce kernel functions—the mathematical objects that make all of this practical.

1 / 5

Loading learning content...

Machine LearningKernel Methods for Regression

The Kernel Trick

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

1 / 5

Feature Space Mapping

The Limitation of Linearity

But here lies a profound limitation: the real world is rarely linear.

Consider these everyday phenomena:

The trajectory of a thrown ball follows a parabola, not a straight line
Drug dosage effects often follow sigmoidal curves with diminishing returns
Economic relationships exhibit complex interactions and threshold effects
Audio signals contain harmonic frequencies that multiply rather than add

When we attempt to model these nonlinear relationships with linear methods, we are fundamentally limited—not by our algorithmic prowess, but by our representation of the problem.

What You Will Learn

The Linearity Constraint in Depth

Let us formalize what we mean by 'linear' and understand precisely where this constraint binds us.

The Linear Model Structure

A linear model in its most general form predicts:

$$\hat{y} = f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b = \sum_{j=1}^{d} w_j x_j + b$$

where:

$\mathbf{x} \in \mathbb{R}^d$ is the input feature vector
$\mathbf{w} \in \mathbb{R}^d$ is the weight vector
$b \in \mathbb{R}$ is the bias term

Linear in Parameters vs. Linear in Inputs

Geometric Interpretation

Geometrically, a linear model in $d$ dimensions defines a hyperplane that separates or fits the data:

In 2D: the model is a line $y = w_1 x_1 + w_2 x_2 + b$
In 3D: the model is a plane $y = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$
In $d$ dimensions: a $(d-1)$-dimensional hyperplane

linear_limitation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import matplotlib.pyplot as plt
 
# Generate nonlinear data: y = x^2 + noise
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y_true = X.ravel()**2
y = y_true + np.random.normal(0, 0.5, 100)
 
# Fit a linear model (best possible line)
from numpy.linalg import lstsq
X_design = np.column_stack([np.ones(100), X])
w_linear = lstsq(X_design, y, rcond=None)[0]
y_pred_linear = X_design @ w_linear
 
# Calculate residual error
mse_linear = np.mean((y - y_pred_linear)**2)
print(f"Linear model MSE: {mse_linear:.4f}")
 
# The linear model achieves MSE ~3.0 on this data
# It cannot capture the parabolic structure
# The best line goes through the middle but misses the curvature

The XOR Problem: A Classic Example

The limitations of linearity are starkly illustrated by the XOR problem—a pattern that is trivially simple for nonlinear methods but fundamentally impossible for any linear classifier:

Points at $(0,0)$ and $(1,1)$ belong to class 0
Points at $(0,1)$ and $(1,0)$ belong to class 1

The Feature Transformation Insight

Here is the key insight that unlocks nonlinear modeling with linear methods:

If the data is not linearly separable in its current representation, perhaps it becomes linearly separable in a different representation.

This deceptively simple idea—transforming the data before applying a linear method—is the foundation of feature space mapping and, ultimately, kernel methods.

The Core Strategy

Instead of fitting a linear model to the original features $\mathbf{x}$, we:

Define a feature map $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^D$ that transforms inputs into a new space
Fit a linear model in the transformed space: $f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})$
The model is linear in $\phi(\mathbf{x})$ but nonlinear in $\mathbf{x}$

The Magic of Representation

Solving XOR with Feature Transformation

Let us return to the XOR problem. With original features $\mathbf{x} = (x_1, x_2)$, no linear separator exists. But consider this feature map:

$$\phi(x_1, x_2) = (x_1, x_2, x_1 \cdot x_2)$$

We have added a third dimension: the product of the two original features. Now let's examine the transformed points:

Original $(x_1, x_2)$	Class	Transformed $(x_1, x_2, x_1 x_2)$
$(0, 0)$	0	$(0, 0, 0)$
$(1, 1)$	0	$(1, 1, 1)$
$(0, 1)$	1	$(0, 1, 0)$
$(1, 0)$	1	$(1, 0, 0)$

The XOR problem, unsolvable by any linear method in the original space, becomes trivially solvable after the right transformation.

xor_feature_transform.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
 
# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])  # XOR labels
 
# Original space: check if linearly separable
# Any line ax + by + c = 0 fails to separate
print("Original space: NOT linearly separable")
 
# Feature transformation: add interaction term
def phi(X):
    """Map (x1, x2) -> (x1, x2, x1*x2)"""
    x1, x2 = X[:, 0], X[:, 1]
    return np.column_stack([x1, x2, x1 * x2])
 
X_transformed = phi(X)
print("Transformed features:")
print(X_transformed)
 
# In transformed space: w = [0, 0, 2], b = -1 separates
# Decision: 2*x1*x2 - 1 > 0 → class 1 (wrong for XOR)
# Actually: -2*x1*x2 + 1 > 0 → class 0
# The hyperplane x1*x2 = 0.5 separates classes
 
w = np.array([0, 0, -2])
b = 1
predictions = (X_transformed @ w + b) > 0
print(f"Predictions: {predictions.astype(int)}")
print(f"Labels:      {y}")
print(f"Accuracy: {np.mean(predictions == y):.1%}")

Polynomial Feature Expansion

The most natural and widely used feature transformation is polynomial expansion—constructing new features as polynomial combinations of the original features.

Univariate Polynomial Features

For a single input $x$, a polynomial feature map of degree $p$ is:

$$\phi_p(x) = (1, x, x^2, x^3, \ldots, x^p)$$

A linear model in this transformed space:

$$f(x) = w_0 + w_1 x + w_2 x^2 + \cdots + w_p x^p$$

is a polynomial of degree $p$ in $x$. By choosing $p$ large enough, we can approximate any continuous function arbitrarily well (by the Weierstrass approximation theorem).

Multivariate Polynomial Features

For $d$-dimensional inputs, polynomial features include all monomials up to degree $p$:

$$\phi_p(\mathbf{x}) = (1, x_1, x_2, \ldots, x_d, x_1^2, x_1 x_2, x_1 x_3, \ldots, x_d^p)$$

This includes:

All individual features: $x_j$
All pairwise products: $x_i x_j$
All higher-order interactions: $x_i x_j x_k$, etc.
All pure powers: $x_j^2, x_j^3, \ldots$

The Curse of Dimensionality Lurks

Dimension of Polynomial Feature Spaces

The dimensionality $D$ of the polynomial feature space is given by:

$$D = \binom{d + p}{p} = \frac{(d + p)!}{d! \cdot p!}$$

Some examples:

Input dimension $d$	Degree $p$	Feature dimension $D$
2	2	6
2	3	10
10	2	66
10	3	286
100	2	5,151
100	3	176,851
100	5	96,560,646

The exponential growth makes explicit feature computation infeasible for high degrees or high-dimensional inputs.

polynomial_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from itertools import combinations_with_replacement
from math import comb
 
def polynomial_features(X, degree):
    """
    Compute polynomial features up to given degree.
    
    For X with shape (n_samples, n_features), returns
    all polynomial combinations up to 'degree'.
    """
    n_samples, n_features = X.shape
    
    # Calculate output dimension
    n_output_features = comb(n_features + degree, degree)
    print(f"Input: {n_features} features -> Output: {n_output_features} features")
    
    features = []
    
    # Degree 0: constant term
    features.append(np.ones(n_samples))
    
    # Degrees 1 through p
    for d in range(1, degree + 1):
        # All combinations of d features (with replacement)
        for combo in combinations_with_replacement(range(n_features), d):
            # Multiply the selected features
            feature = np.ones(n_samples)
            for idx in combo:
                feature *= X[:, idx]
            features.append(feature)
    
    return np.column_stack(features)
 
# Example: 2D input, degree 2
X = np.array([[1, 2], [3, 4], [5, 6]])
X_poly = polynomial_features(X, degree=2)
print(f"Shape: {X_poly.shape}")
print("Features: [1, x1, x2, x1^2, x1*x2, x2^2]")
print(X_poly)
 
# For (1, 2): [1, 1, 2, 1, 2, 4]
# For (3, 4): [1, 3, 4, 9, 12, 16]
# For (5, 6): [1, 5, 6, 25, 30, 36]

Why Polynomial Features Work

Polynomial features work because they capture:

Nonlinear effects: Powers like $x^2$ and $x^3$ model curvature
Interaction effects: Products like $x_1 x_2$ capture dependencies between features
Threshold effects: High-degree terms can approximate step-like functions

Beyond Polynomials: General Feature Maps

While polynomial features are intuitive, they are just one family among infinitely many possible feature transformations. The concept of feature mapping is far more general.

The Abstract Perspective

A feature map (also called a basis function expansion) is any function:

$$\phi: \mathcal{X} \rightarrow \mathcal{H}$$

that transforms inputs from the original space $\mathcal{X}$ (often $\mathbb{R}^d$) into a feature space $\mathcal{H}$ (often $\mathbb{R}^D$ with $D > d$, or even infinite-dimensional).

The only requirement is that we can subsequently fit a linear model in $\mathcal{H}$:

$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) = \sum_{j=1}^{D} w_j \phi_j(\mathbf{x})$$

Different choices of $\phi$ encode different inductive biases—assumptions about what kinds of functions are likely to fit the data well.

Common Feature Map Families

•Polynomial Features: Monomials $x_1^{a_1} x_2^{a_2} \cdots x_d^{a_d}$ where $\sum a_j \leq p$. Captures smooth nonlinearities and interactions.
•Fourier Features: Sinusoids $\cos(\omega_j^\top \mathbf{x})$ and $\sin(\omega_j^\top \mathbf{x})$ for various frequencies $\omega_j$. Captures periodic patterns.
•Radial Basis Functions (RBF): $\exp(-\gamma |\mathbf{x} - \mathbf{c}_j|^2)$ centered at various points $\mathbf{c}_j$. Captures local bumps of influence.
•Wavelet Features: Localized oscillations at multiple scales. Captures patterns at different resolutions.
•Indicator Features: $\mathbf{1}[\mathbf{x} \in R_j]$ for regions $R_j$. Captures piecewise constant approximations.
•Learned Features: Neural network hidden layers. Captures patterns learned from data.

Radial Basis Function Features

RBF features are particularly important because they connect to the Gaussian kernel we will study later:

$$\phi_\text{RBF}(\mathbf{x}) = \left( \exp\left(-\gamma |\mathbf{x} - \mathbf{c}_1|^2\right), \ldots, \exp\left(-\gamma |\mathbf{x} - \mathbf{c}_m|^2\right) \right)$$

Each feature measures the similarity of $\mathbf{x}$ to a center point $\mathbf{c}_j$. Points near $\mathbf{c}_j$ activate that feature strongly; distant points activate it weakly.

The parameter $\gamma$ controls the bandwidth:

Large $\gamma$: narrow, localized bumps (each center affects only nearby points)
Small $\gamma$: broad, overlapping bumps (each center affects points far away)

With enough centers (potentially one per training point), RBF features can approximate any function.

rbf_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_features(X, centers, gamma):
    """
    Compute RBF features.
    
    Parameters:
    - X: (n_samples, n_features) input data
    - centers: (n_centers, n_features) RBF centers
    - gamma: bandwidth parameter
    
    Returns:
    - (n_samples, n_centers) RBF feature matrix
    """
    # Compute squared distances between each x and each center
    # Using the identity ||a - b||^2 = ||a||^2 + ||b||^2 - 2 a·b
    X_sq = np.sum(X**2, axis=1, keepdims=True)  # (n, 1)
    C_sq = np.sum(centers**2, axis=1, keepdims=True)  # (m, 1)
    XC = X @ centers.T  # (n, m)
    
    sq_dist = X_sq + C_sq.T - 2 * XC  # (n, m)
    return np.exp(-gamma * sq_dist)
 
# Example: nonlinear 1D regression
np.random.seed(42)
X = np.linspace(0, 4*np.pi, 50).reshape(-1, 1)
y = np.sin(X.ravel()) + 0.1 * np.random.randn(50)
 
# Create RBF features using training points as centers
centers = X.copy()  # Each point is a center
gamma = 1.0
Phi = rbf_features(X, centers, gamma)
 
print(f"RBF feature matrix shape: {Phi.shape}")
print(f"Phi[0, 0] (x near center 0): {Phi[0, 0]:.4f}")
print(f"Phi[0, 25] (x far from center 25): {Phi[0, 25]:.6f}")
 
# Fit linear regression in RBF space (with regularization)
lambda_reg = 0.001
w = np.linalg.solve(Phi.T @ Phi + lambda_reg * np.eye(50), Phi.T @ y)
 
# Predictions
y_pred = Phi @ w
mse = np.mean((y - y_pred)**2)
print(f"MSE: {mse:.6f}")  # Near-perfect fit on training data

The High-Dimensional Geometry

To truly understand why feature mapping works, we must develop geometric intuition for what happens when we lift data into higher dimensions.

The Lifting Metaphor

This is precisely what feature mapping does: it lifts data from a low-dimensional space where linear separation is impossible into a higher-dimensional space where it becomes possible.

Mathematical Intuition

Consider points on a circle: $\mathbf{x} = (\cos\theta, \sin\theta)$. In 2D, these points lie on a curve. But with the feature map:

$$\phi(x_1, x_2) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2)$$

the transformed points satisfy:

$$|\phi(\mathbf{x})|^2 = x_1^4 + 2x_1^2 x_2^2 + x_2^4 = (x_1^2 + x_2^2)^2 = 1$$

All points on the unit circle in 2D lie on a specific manifold in feature space. The curved structure in the original space becomes a different (possibly simpler) structure in feature space.

Cover's Theorem on Separability

The Inner Product Perspective

A crucial insight emerges when we consider inner products in feature space. The inner product between two transformed points is:

$$\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle = \sum_{j=1}^{D} \phi_j(\mathbf{x}) \phi_j(\mathbf{x}')$$

This inner product measures similarity in feature space. Points that are similar (in the transformed representation) have large inner products; dissimilar points have small ones.

Here is the key observation that leads to kernels:

For certain feature maps, the inner product $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ can be computed directly from $\mathbf{x}$ and $\mathbf{x}'$, without ever computing $\phi(\mathbf{x})$ or $\phi(\mathbf{x}')$ explicitly.

This observation is the foundation of the kernel trick, which we will develop in the next page.

inner_product_identity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
 
# Example: degree-2 polynomial feature map in 2D
def phi_poly2(x):
    """
    Feature map: (x1, x2) -> (x1^2, sqrt(2)*x1*x2, x2^2)
    """
    x1, x2 = x[0], x[1]
    return np.array([x1**2, np.sqrt(2) * x1 * x2, x2**2])
 
# Two points
x = np.array([1.0, 2.0])
x_prime = np.array([3.0, 4.0])
 
# Method 1: Compute features, then inner product
phi_x = phi_poly2(x)
phi_x_prime = phi_poly2(x_prime)
inner_prod_explicit = np.dot(phi_x, phi_x_prime)
print(f"Explicit computation:")
print(f"  φ(x) = {phi_x}")
print(f"  φ(x') = {phi_x_prime}")
print(f"  <φ(x), φ(x')> = {inner_prod_explicit}")
 
# Method 2: Direct computation (the kernel trick preview)
# For this specific φ: <φ(x), φ(x')> = (x · x')^2
inner_prod_direct = np.dot(x, x_prime)**2
print(f"
Direct computation:")
print(f"  x · x' = {np.dot(x, x_prime)}")
print(f"  (x · x')^2 = {inner_prod_direct}")
 
# They are equal!
print(f"
Equal? {np.allclose(inner_prod_explicit, inner_prod_direct)}")
 
# This is remarkable: we get the same result without
# ever computing the 3D feature vectors!

The Computational Challenge

We have established that feature mapping enables linear methods to learn nonlinear patterns. However, this power comes with a severe computational burden.

The Explicit Feature Problem

Consider ridge regression with polynomial features of degree $p$. The algorithm requires:

Feature computation: Transform all $n$ training points: $O(n \cdot D)$
Gram matrix: Compute $\Phi^\top \Phi$ where $\Phi$ is $n \times D$: $O(n D^2)$ multiplications
Inversion: Solve the normal equations: $O(D^3)$
Prediction: For each test point, compute $D$-dimensional features: $O(D)$

When $D = \binom{d+p}{p}$ grows exponentially in $p$, even storing the feature matrix becomes infeasible, let alone computing with it.

A Concrete Example

With $d = 100$ input features and degree $p = 5$:

Feature dimension: $D \approx 96$ million
Feature matrix for $n = 10,000$ points: $10^4 \times 10^8 = 10^{12}$ elements
Storage at 8 bytes/double: ~8 terabytes
Matrix multiplication: $10^{24}$ operations

This is utterly impractical. Yet the underlying model—a degree-5 polynomial—might be exactly what we need.

The Barrier We Face

Infinite-Dimensional Feature Spaces

Some of the most powerful feature maps are not just high-dimensional but infinite-dimensional. The Gaussian RBF feature map, for example, maps each input to an infinite-dimensional vector:

$$\phi_\text{Gaussian}(\mathbf{x}) \in \mathbb{R}^\infty$$

We cannot possibly compute or store infinite-dimensional vectors. Yet, as we will see, we can work with them through kernels.

The Dual Representation Hint

Recall from our study of ridge regression that the solution can be written in dual form:

$$\mathbf{w} = \Phi^\top \boldsymbol{\alpha} = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i)$$

for some coefficients $\boldsymbol{\alpha} \in \mathbb{R}^n$. The weight vector is a linear combination of the training points in feature space.

This means that predictions become:

$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}) \rangle$$

The prediction depends only on inner products between the test point and training points in feature space—not on the features themselves!

This observation is the gateway to the kernel trick: if we can compute these inner products efficiently, we never need the features explicitly.

Explicit Features vs. Kernel Approach
Aspect	Explicit Features	Kernel Approach
Feature computation	O(n · D)	Not needed
Storage	O(n · D)	O(n²)
Works for D → ∞	No	Yes
Computational bottleneck	Feature dimension D	Sample size n
Suited for	Low-dimensional features	High/infinite dimensions

Preview: The Kernel Trick

We have built up to a remarkable insight. Let us state it clearly before developing it fully in the next page.

The Key Observation

For certain feature maps $\phi$, there exists a kernel function $k$ such that:

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$

The kernel computes the inner product in feature space directly from the original inputs, without ever constructing the feature vectors.

Example: Polynomial Kernel

For the degree-2 polynomial feature map we saw earlier, the kernel is simply:

$$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}')^2$$

Computing the left side: one inner product and one squaring — $O(d)$.

Computing the right side explicitly: construct $O(d^2)$ features, then take inner product — $O(d^2)$.

The kernel is quadratically faster, and this gap widens with polynomial degree.

The Kernel Trick Summary

The kernel trick lets us:

Work implicitly in high or infinite-dimensional feature spaces
Compute similarities without constructing features
Use algorithms that depend only on inner products
Achieve computational costs that depend on n (samples) rather than D (feature dimension)

This is one of the most elegant ideas in machine learning.

Algorithms That Work With Kernels

Ridge regression (kernel ridge regression)
Support vector machines (kernel SVM)
Principal component analysis (kernel PCA)
Gaussian processes (built on kernels)
Kernel k-means clustering

These algorithms can be "kernelized" to work in arbitrary feature spaces defined by the kernel.

What Comes Next

In the following pages, we will:

Formally define kernel functions and their properties
Prove that the kernel trick preserves computational correctness
State Mercer's theorem, which characterizes valid kernels
Survey common kernel families and their geometric interpretations
Apply kernels to regression (kernel ridge regression and Gaussian processes)

Summary

We have established the foundational concepts that motivate kernel methods:

The Problem: Linear models cannot capture nonlinear relationships in data. When the true function is curved, kinked, or involves interactions, linear predictions fail.

The Challenge: Explicit feature computation is infeasible when the feature dimension is very large or infinite.

Key Takeaways

•Linear models are limited — they can only capture hyperplane relationships in the input space.
•Feature maps unlock nonlinearity — by transforming inputs, we can apply linear methods to nonlinear problems.
•Polynomial features grow explosively — the dimension $\binom{d+p}{p}$ is impractical for moderate $d$ and $p$.
•Inner products are key — many algorithms depend only on inner products, not on features directly.
•The kernel trick awaits — kernels compute feature-space inner products without explicit features.

Page Complete

1 / 5