Loading learning content...
Throughout our journey in regression, we have relied heavily on linear models—methods that express predictions as weighted combinations of input features. Linear regression, ridge regression, and lasso all share this fundamental assumption: the relationship between inputs and outputs can be captured by a linear function.
But here lies a profound limitation: the real world is rarely linear.
Consider these everyday phenomena:
When we attempt to model these nonlinear relationships with linear methods, we are fundamentally limited—not by our algorithmic prowess, but by our representation of the problem.
By the end of this page, you will understand why mapping data to higher-dimensional feature spaces enables linear methods to capture nonlinear patterns. You'll see how this innocent-seeming transformation—feature space mapping—forms the conceptual foundation for kernel methods, one of the most powerful ideas in machine learning.
Let us formalize what we mean by 'linear' and understand precisely where this constraint binds us.
The Linear Model Structure
A linear model in its most general form predicts:
$$\hat{y} = f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b = \sum_{j=1}^{d} w_j x_j + b$$
where:
This model is called 'linear' because $f(\mathbf{x})$ is a linear function of $\mathbf{x}$—technically, an affine function since we include the bias. The crucial property is that the prediction changes proportionally and additively with each input feature.
An important distinction: 'linear model' typically means linear in the parameters (weights). A model like $y = w_1 x^2 + w_2 x + b$ is linear in its parameters ($w_1, w_2, b$) even though it's quadratic in $x$. This distinction becomes crucial when we discuss feature mappings.
Geometric Interpretation
Geometrically, a linear model in $d$ dimensions defines a hyperplane that separates or fits the data:
The fundamental limitation is clear: a hyperplane can only capture relationships that are themselves planar. When the true relationship curves, bends, or twists through space, no hyperplane—no matter how cleverly positioned—can adequately describe it.
12345678910111213141516171819202122
import numpy as npimport matplotlib.pyplot as plt # Generate nonlinear data: y = x^2 + noisenp.random.seed(42)X = np.linspace(-3, 3, 100).reshape(-1, 1)y_true = X.ravel()**2y = y_true + np.random.normal(0, 0.5, 100) # Fit a linear model (best possible line)from numpy.linalg import lstsqX_design = np.column_stack([np.ones(100), X])w_linear = lstsq(X_design, y, rcond=None)[0]y_pred_linear = X_design @ w_linear # Calculate residual errormse_linear = np.mean((y - y_pred_linear)**2)print(f"Linear model MSE: {mse_linear:.4f}") # The linear model achieves MSE ~3.0 on this data# It cannot capture the parabolic structure# The best line goes through the middle but misses the curvatureThe XOR Problem: A Classic Example
The limitations of linearity are starkly illustrated by the XOR problem—a pattern that is trivially simple for nonlinear methods but fundamentally impossible for any linear classifier:
No single line can separate these four points by class. This is because the XOR function is not linearly separable—there exists no hyperplane that places all class-0 points on one side and all class-1 points on the other.
This example, though simple, represents a fundamental barrier that appears throughout machine learning whenever the underlying relationship involves interactions, thresholds, or complex dependencies between features.
Here is the key insight that unlocks nonlinear modeling with linear methods:
If the data is not linearly separable in its current representation, perhaps it becomes linearly separable in a different representation.
This deceptively simple idea—transforming the data before applying a linear method—is the foundation of feature space mapping and, ultimately, kernel methods.
The Core Strategy
Instead of fitting a linear model to the original features $\mathbf{x}$, we:
The function $\phi$ does not change the underlying problem—the same data points, the same labels. What changes is our representation of the problem. A curve in $\mathbb{R}^2$ might unfold into a straight line in $\mathbb{R}^{10}$. The complexity hasn't disappeared; it has been absorbed into the transformation.
Solving XOR with Feature Transformation
Let us return to the XOR problem. With original features $\mathbf{x} = (x_1, x_2)$, no linear separator exists. But consider this feature map:
$$\phi(x_1, x_2) = (x_1, x_2, x_1 \cdot x_2)$$
We have added a third dimension: the product of the two original features. Now let's examine the transformed points:
| Original $(x_1, x_2)$ | Class | Transformed $(x_1, x_2, x_1 x_2)$ |
|---|---|---|
| $(0, 0)$ | 0 | $(0, 0, 0)$ |
| $(1, 1)$ | 0 | $(1, 1, 1)$ |
| $(0, 1)$ | 1 | $(0, 1, 0)$ |
| $(1, 0)$ | 1 | $(1, 0, 0)$ |
In the 3D transformed space, the class-0 points have $x_1 x_2 \in {0, 1}$ spanning both extremes, while class-1 points have $x_1 x_2 = 0$. A hyperplane like $x_1 x_2 = 0.5$ now separates the classes perfectly!
The XOR problem, unsolvable by any linear method in the original space, becomes trivially solvable after the right transformation.
12345678910111213141516171819202122232425262728293031
import numpy as np # XOR dataX = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y = np.array([0, 1, 1, 0]) # XOR labels # Original space: check if linearly separable# Any line ax + by + c = 0 fails to separateprint("Original space: NOT linearly separable") # Feature transformation: add interaction termdef phi(X): """Map (x1, x2) -> (x1, x2, x1*x2)""" x1, x2 = X[:, 0], X[:, 1] return np.column_stack([x1, x2, x1 * x2]) X_transformed = phi(X)print("Transformed features:")print(X_transformed) # In transformed space: w = [0, 0, 2], b = -1 separates# Decision: 2*x1*x2 - 1 > 0 → class 1 (wrong for XOR)# Actually: -2*x1*x2 + 1 > 0 → class 0# The hyperplane x1*x2 = 0.5 separates classes w = np.array([0, 0, -2])b = 1predictions = (X_transformed @ w + b) > 0print(f"Predictions: {predictions.astype(int)}")print(f"Labels: {y}")print(f"Accuracy: {np.mean(predictions == y):.1%}")The most natural and widely used feature transformation is polynomial expansion—constructing new features as polynomial combinations of the original features.
Univariate Polynomial Features
For a single input $x$, a polynomial feature map of degree $p$ is:
$$\phi_p(x) = (1, x, x^2, x^3, \ldots, x^p)$$
A linear model in this transformed space:
$$f(x) = w_0 + w_1 x + w_2 x^2 + \cdots + w_p x^p$$
is a polynomial of degree $p$ in $x$. By choosing $p$ large enough, we can approximate any continuous function arbitrarily well (by the Weierstrass approximation theorem).
Multivariate Polynomial Features
For $d$-dimensional inputs, polynomial features include all monomials up to degree $p$:
$$\phi_p(\mathbf{x}) = (1, x_1, x_2, \ldots, x_d, x_1^2, x_1 x_2, x_1 x_3, \ldots, x_d^p)$$
This includes:
The number of polynomial features grows explosively. For $d$ input features and degree $p$, the number of features is $\binom{d + p}{p}$. With $d = 100$ features and $p = 3$, this exceeds 176,000 features. With $p = 5$, it exceeds 96 million! This explosion is where the kernel trick will save us.
Dimension of Polynomial Feature Spaces
The dimensionality $D$ of the polynomial feature space is given by:
$$D = \binom{d + p}{p} = \frac{(d + p)!}{d! \cdot p!}$$
Some examples:
| Input dimension $d$ | Degree $p$ | Feature dimension $D$ |
|---|---|---|
| 2 | 2 | 6 |
| 2 | 3 | 10 |
| 10 | 2 | 66 |
| 10 | 3 | 286 |
| 100 | 2 | 5,151 |
| 100 | 3 | 176,851 |
| 100 | 5 | 96,560,646 |
The exponential growth makes explicit feature computation infeasible for high degrees or high-dimensional inputs.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as npfrom itertools import combinations_with_replacementfrom math import comb def polynomial_features(X, degree): """ Compute polynomial features up to given degree. For X with shape (n_samples, n_features), returns all polynomial combinations up to 'degree'. """ n_samples, n_features = X.shape # Calculate output dimension n_output_features = comb(n_features + degree, degree) print(f"Input: {n_features} features -> Output: {n_output_features} features") features = [] # Degree 0: constant term features.append(np.ones(n_samples)) # Degrees 1 through p for d in range(1, degree + 1): # All combinations of d features (with replacement) for combo in combinations_with_replacement(range(n_features), d): # Multiply the selected features feature = np.ones(n_samples) for idx in combo: feature *= X[:, idx] features.append(feature) return np.column_stack(features) # Example: 2D input, degree 2X = np.array([[1, 2], [3, 4], [5, 6]])X_poly = polynomial_features(X, degree=2)print(f"Shape: {X_poly.shape}")print("Features: [1, x1, x2, x1^2, x1*x2, x2^2]")print(X_poly) # For (1, 2): [1, 1, 2, 1, 2, 4]# For (3, 4): [1, 3, 4, 9, 12, 16]# For (5, 6): [1, 5, 6, 25, 30, 36]Why Polynomial Features Work
Polynomial features work because they capture:
The Taylor series expansion tells us that any smooth function can be locally approximated by a polynomial. With enough polynomial features and appropriate regularization, we can model arbitrarily complex relationships.
While polynomial features are intuitive, they are just one family among infinitely many possible feature transformations. The concept of feature mapping is far more general.
The Abstract Perspective
A feature map (also called a basis function expansion) is any function:
$$\phi: \mathcal{X} \rightarrow \mathcal{H}$$
that transforms inputs from the original space $\mathcal{X}$ (often $\mathbb{R}^d$) into a feature space $\mathcal{H}$ (often $\mathbb{R}^D$ with $D > d$, or even infinite-dimensional).
The only requirement is that we can subsequently fit a linear model in $\mathcal{H}$:
$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) = \sum_{j=1}^{D} w_j \phi_j(\mathbf{x})$$
Different choices of $\phi$ encode different inductive biases—assumptions about what kinds of functions are likely to fit the data well.
Radial Basis Function Features
RBF features are particularly important because they connect to the Gaussian kernel we will study later:
$$\phi_\text{RBF}(\mathbf{x}) = \left( \exp\left(-\gamma |\mathbf{x} - \mathbf{c}_1|^2\right), \ldots, \exp\left(-\gamma |\mathbf{x} - \mathbf{c}_m|^2\right) \right)$$
Each feature measures the similarity of $\mathbf{x}$ to a center point $\mathbf{c}_j$. Points near $\mathbf{c}_j$ activate that feature strongly; distant points activate it weakly.
The parameter $\gamma$ controls the bandwidth:
With enough centers (potentially one per training point), RBF features can approximate any function.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npimport matplotlib.pyplot as plt def rbf_features(X, centers, gamma): """ Compute RBF features. Parameters: - X: (n_samples, n_features) input data - centers: (n_centers, n_features) RBF centers - gamma: bandwidth parameter Returns: - (n_samples, n_centers) RBF feature matrix """ # Compute squared distances between each x and each center # Using the identity ||a - b||^2 = ||a||^2 + ||b||^2 - 2 a·b X_sq = np.sum(X**2, axis=1, keepdims=True) # (n, 1) C_sq = np.sum(centers**2, axis=1, keepdims=True) # (m, 1) XC = X @ centers.T # (n, m) sq_dist = X_sq + C_sq.T - 2 * XC # (n, m) return np.exp(-gamma * sq_dist) # Example: nonlinear 1D regressionnp.random.seed(42)X = np.linspace(0, 4*np.pi, 50).reshape(-1, 1)y = np.sin(X.ravel()) + 0.1 * np.random.randn(50) # Create RBF features using training points as centerscenters = X.copy() # Each point is a centergamma = 1.0Phi = rbf_features(X, centers, gamma) print(f"RBF feature matrix shape: {Phi.shape}")print(f"Phi[0, 0] (x near center 0): {Phi[0, 0]:.4f}")print(f"Phi[0, 25] (x far from center 25): {Phi[0, 25]:.6f}") # Fit linear regression in RBF space (with regularization)lambda_reg = 0.001w = np.linalg.solve(Phi.T @ Phi + lambda_reg * np.eye(50), Phi.T @ y) # Predictionsy_pred = Phi @ wmse = np.mean((y - y_pred)**2)print(f"MSE: {mse:.6f}") # Near-perfect fit on training dataTo truly understand why feature mapping works, we must develop geometric intuition for what happens when we lift data into higher dimensions.
The Lifting Metaphor
Imagine you have points scattered on a sheet of paper (2D), and they cannot be separated by a straight line. Now imagine crumpling that paper into a ball (embedding it in 3D). Suddenly, a plane that cuts through the ball might separate the points that were inseparable on the flat sheet.
This is precisely what feature mapping does: it lifts data from a low-dimensional space where linear separation is impossible into a higher-dimensional space where it becomes possible.
Mathematical Intuition
Consider points on a circle: $\mathbf{x} = (\cos\theta, \sin\theta)$. In 2D, these points lie on a curve. But with the feature map:
$$\phi(x_1, x_2) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2)$$
the transformed points satisfy:
$$|\phi(\mathbf{x})|^2 = x_1^4 + 2x_1^2 x_2^2 + x_2^4 = (x_1^2 + x_2^2)^2 = 1$$
All points on the unit circle in 2D lie on a specific manifold in feature space. The curved structure in the original space becomes a different (possibly simpler) structure in feature space.
Cover's theorem (1965) formalizes this intuition: a complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. The probability of linear separability increases with dimensionality—but we must balance this against the danger of overfitting.
The Inner Product Perspective
A crucial insight emerges when we consider inner products in feature space. The inner product between two transformed points is:
$$\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle = \sum_{j=1}^{D} \phi_j(\mathbf{x}) \phi_j(\mathbf{x}')$$
This inner product measures similarity in feature space. Points that are similar (in the transformed representation) have large inner products; dissimilar points have small ones.
Here is the key observation that leads to kernels:
For certain feature maps, the inner product $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ can be computed directly from $\mathbf{x}$ and $\mathbf{x}'$, without ever computing $\phi(\mathbf{x})$ or $\phi(\mathbf{x}')$ explicitly.
This observation is the foundation of the kernel trick, which we will develop in the next page.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as np # Example: degree-2 polynomial feature map in 2Ddef phi_poly2(x): """ Feature map: (x1, x2) -> (x1^2, sqrt(2)*x1*x2, x2^2) """ x1, x2 = x[0], x[1] return np.array([x1**2, np.sqrt(2) * x1 * x2, x2**2]) # Two pointsx = np.array([1.0, 2.0])x_prime = np.array([3.0, 4.0]) # Method 1: Compute features, then inner productphi_x = phi_poly2(x)phi_x_prime = phi_poly2(x_prime)inner_prod_explicit = np.dot(phi_x, phi_x_prime)print(f"Explicit computation:")print(f" φ(x) = {phi_x}")print(f" φ(x') = {phi_x_prime}")print(f" <φ(x), φ(x')> = {inner_prod_explicit}") # Method 2: Direct computation (the kernel trick preview)# For this specific φ: <φ(x), φ(x')> = (x · x')^2inner_prod_direct = np.dot(x, x_prime)**2print(f"Direct computation:")print(f" x · x' = {np.dot(x, x_prime)}")print(f" (x · x')^2 = {inner_prod_direct}") # They are equal!print(f"Equal? {np.allclose(inner_prod_explicit, inner_prod_direct)}") # This is remarkable: we get the same result without# ever computing the 3D feature vectors!We have established that feature mapping enables linear methods to learn nonlinear patterns. However, this power comes with a severe computational burden.
The Explicit Feature Problem
Consider ridge regression with polynomial features of degree $p$. The algorithm requires:
When $D = \binom{d+p}{p}$ grows exponentially in $p$, even storing the feature matrix becomes infeasible, let alone computing with it.
A Concrete Example
With $d = 100$ input features and degree $p = 5$:
This is utterly impractical. Yet the underlying model—a degree-5 polynomial—might be exactly what we need.
We have a powerful idea (feature expansion) that is computationally infeasible for the cases where we need it most (high-degree polynomials, high-dimensional inputs). We need a way to work in high-dimensional feature spaces without ever constructing the features explicitly.
Infinite-Dimensional Feature Spaces
Some of the most powerful feature maps are not just high-dimensional but infinite-dimensional. The Gaussian RBF feature map, for example, maps each input to an infinite-dimensional vector:
$$\phi_\text{Gaussian}(\mathbf{x}) \in \mathbb{R}^\infty$$
We cannot possibly compute or store infinite-dimensional vectors. Yet, as we will see, we can work with them through kernels.
The Dual Representation Hint
Recall from our study of ridge regression that the solution can be written in dual form:
$$\mathbf{w} = \Phi^\top \boldsymbol{\alpha} = \sum_{i=1}^{n} \alpha_i \phi(\mathbf{x}_i)$$
for some coefficients $\boldsymbol{\alpha} \in \mathbb{R}^n$. The weight vector is a linear combination of the training points in feature space.
This means that predictions become:
$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}) \rangle$$
The prediction depends only on inner products between the test point and training points in feature space—not on the features themselves!
This observation is the gateway to the kernel trick: if we can compute these inner products efficiently, we never need the features explicitly.
| Aspect | Explicit Features | Kernel Approach |
|---|---|---|
| Feature computation | O(n · D) | Not needed |
| Storage | O(n · D) | O(n²) |
| Works for D → ∞ | No | Yes |
| Computational bottleneck | Feature dimension D | Sample size n |
| Suited for | Low-dimensional features | High/infinite dimensions |
We have built up to a remarkable insight. Let us state it clearly before developing it fully in the next page.
The Key Observation
For certain feature maps $\phi$, there exists a kernel function $k$ such that:
$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$
The kernel computes the inner product in feature space directly from the original inputs, without ever constructing the feature vectors.
Example: Polynomial Kernel
For the degree-2 polynomial feature map we saw earlier, the kernel is simply:
$$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}')^2$$
Computing the left side: one inner product and one squaring — $O(d)$.
Computing the right side explicitly: construct $O(d^2)$ features, then take inner product — $O(d^2)$.
The kernel is quadratically faster, and this gap widens with polynomial degree.
The kernel trick lets us:
This is one of the most elegant ideas in machine learning.
Algorithms That Work With Kernels
Not all algorithms can exploit the kernel trick—only those whose computations can be expressed entirely in terms of inner products between data points. Fortunately, many important algorithms qualify:
These algorithms can be "kernelized" to work in arbitrary feature spaces defined by the kernel.
What Comes Next
In the following pages, we will:
We have established the foundational concepts that motivate kernel methods:
The Problem: Linear models cannot capture nonlinear relationships in data. When the true function is curved, kinked, or involves interactions, linear predictions fail.
The Solution: Transform the data using a feature map $\phi$ into a new space where the relationship is linear. A model that is linear in $\phi(\mathbf{x})$ can be arbitrarily nonlinear in $\mathbf{x}$.
The Challenge: Explicit feature computation is infeasible when the feature dimension is very large or infinite.
The Insight: We only need inner products in feature space. If we can compute $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ directly via a kernel $k(\mathbf{x}, \mathbf{x}')$, we bypass the explicit feature computation entirely.
You now understand why we need feature space mapping and the computational challenges it presents. In the next page, we will formally introduce kernel functions—the mathematical objects that make all of this practical.