Loading learning content...
Linear regression is one of the most powerful tools in machine learning, but it comes with a fundamental limitation: it assumes the relationship between features and target is a straight line. In reality, most phenomena in the physical world, economics, biology, and engineering exhibit curved, nonlinear patterns.
Consider the following examples:
Polynomial regression elegantly bridges this gap by extending linear regression to model nonlinear relationships—while keeping all the computational conveniences of linear algebra intact.
By the end of this page, you will understand: (1) The mathematical formulation of polynomial features; (2) How polynomial regression remains 'linear' in a specific mathematical sense; (3) The complete feature transformation process; (4) Computational considerations and the design matrix structure; (5) When polynomial features are appropriate vs. other nonlinear methods.
The key insight that makes polynomial regression work is the distinction between linearity in features and linearity in parameters. When we say linear regression is 'linear,' we mean the model output is a linear combination of the parameters—not necessarily a linear function of the input.
Standard Linear Regression: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon$$
This model is linear in both the parameters $(\beta_0, \beta_1, \ldots, \beta_p)$ and the features $(x_1, x_2, \ldots, x_p)$.
Polynomial Regression (degree 2): $$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon$$
This model is nonlinear in the input $x$, but critically, it remains linear in the parameters $(\beta_0, \beta_1, \beta_2)$. This linearity in parameters is what matters for applying all the machinery of linear regression.
If we define new features $z_1 = x$ and $z_2 = x^2$, the polynomial model becomes: $$y = \beta_0 + \beta_1 z_1 + \beta_2 z_2 + \epsilon$$ This is just ordinary linear regression in the transformed feature space! All the tools we developed for linear regression—OLS, gradient descent, closed-form solutions—work without modification.
Why This Matters Mathematically:
The normal equations for OLS require the model to be linear in parameters to yield a closed-form solution: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$
For this to work, the design matrix $\mathbf{X}$ can contain any transformations of the original inputs—squares, cubes, cross products—as long as the model remains: $$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$
where $\boldsymbol{\beta}$ appears linearly. This is the foundation of all feature engineering approaches that extend linear models.
Single Variable Case:
Given a single input variable $x$ and a polynomial of degree $d$, the polynomial feature expansion creates $d+1$ features (including the intercept):
$$\phi(x) = [1, x, x^2, x^3, \ldots, x^d]^T$$
The polynomial regression model becomes: $$y = \beta_0 \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2 + \ldots + \beta_d \cdot x^d + \epsilon$$
Or in matrix form for $n$ observations: $$\mathbf{y} = \mathbf{\Phi} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$
where $\mathbf{\Phi}$ is the Vandermonde matrix:
$$\mathbf{\Phi} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom typing import Tuple def polynomial_features_1d(x: np.ndarray, degree: int) -> np.ndarray: """ Generate polynomial features for a single variable. Parameters: x: Input array of shape (n_samples,) or (n_samples, 1) degree: Maximum polynomial degree Returns: Feature matrix of shape (n_samples, degree + 1) Columns are [1, x, x^2, ..., x^degree] """ x = np.asarray(x).flatten() n = len(x) # Build Vandermonde matrix # Each column is x raised to the corresponding power Phi = np.column_stack([x**k for k in range(degree + 1)]) return Phi def polynomial_regression_fit(x: np.ndarray, y: np.ndarray, degree: int) -> Tuple[np.ndarray, np.ndarray]: """ Fit polynomial regression using the normal equations. Parameters: x: Input features (n_samples,) y: Target values (n_samples,) degree: Polynomial degree Returns: beta: Fitted coefficients (degree + 1,) Phi: Design matrix (n_samples, degree + 1) """ # Generate polynomial features Phi = polynomial_features_1d(x, degree) # Normal equations: beta = (Phi^T @ Phi)^{-1} @ Phi^T @ y # Using numpy's lstsq for numerical stability beta, residuals, rank, s = np.linalg.lstsq(Phi, y, rcond=None) return beta, Phi # Example: Fitting a quadratic to noisy datanp.random.seed(42)x = np.linspace(-3, 3, 50)y_true = 2 - 0.5*x + 0.8*x**2 # True quadraticy = y_true + np.random.normal(0, 0.5, size=len(x)) # Add noise # Fit polynomial of degree 2beta, Phi = polynomial_regression_fit(x, y, degree=2)print(f"Fitted coefficients: β₀={beta[0]:.3f}, β₁={beta[1]:.3f}, β₂={beta[2]:.3f}")print(f"True coefficients: β₀=2.000, β₁=-0.500, β₂=0.800")The Vandermonde matrix is named after Alexandre-Théophile Vandermonde. It appears in polynomial interpolation, discrete Fourier transforms, and numerous other applications. Its determinant has a beautiful closed form: $\det(\mathbf{V}) = \prod_{i<j}(x_j - x_i)$, which is non-zero when all $x_i$ are distinct.
Extension to Multiple Variables:
When we have $p$ input features $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, polynomial feature expansion becomes more complex because we must consider interaction terms (cross-products between different variables).
For degree $d$ and $p$ variables, we include all monomials of the form: $$x_1^{k_1} \cdot x_2^{k_2} \cdot \ldots \cdot x_p^{k_p}$$
where $k_1 + k_2 + \ldots + k_p \leq d$ and each $k_i \geq 0$.
Example: Two Variables, Degree 2:
For $\mathbf{x} = (x_1, x_2)$ with $d=2$, the polynomial features are: $$\phi(\mathbf{x}) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]^T$$
| Variables (p) | Degree 2 | Degree 3 | Degree 4 | Degree 5 |
|---|---|---|---|---|
| 1 | 3 | 4 | 5 | 6 |
| 2 | 6 | 10 | 15 | 21 |
| 3 | 10 | 20 | 35 | 56 |
| 5 | 21 | 56 | 126 | 252 |
| 10 | 66 | 286 | 1,001 | 3,003 |
| 20 | 231 | 1,771 | 10,626 | 53,130 |
The Combinatorial Formula:
The number of polynomial features for $p$ variables up to degree $d$ (including the intercept) is: $$N_{\text{features}} = \binom{p + d}{d} = \frac{(p + d)!}{p! \cdot d!}$$
This grows rapidly with both the number of variables and the polynomial degree—a phenomenon known as the curse of dimensionality in the polynomial feature space.
With 20 features and degree 5, you generate over 53,000 polynomial features! This exponential growth means polynomial features work best for low-dimensional problems or when combined with regularization. For high-dimensional data, consider kernel methods or neural networks instead.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npfrom itertools import combinations_with_replacement def polynomial_features_multi(X: np.ndarray, degree: int, include_bias: bool = True) -> np.ndarray: """ Generate polynomial features for multiple variables. Parameters: X: Input array of shape (n_samples, n_features) degree: Maximum polynomial degree include_bias: Whether to include the bias (all-ones) column Returns: Feature matrix with all polynomial terms up to degree d """ n_samples, n_features = X.shape # Generate all combinations of feature indices with repetition # For degree d, we need all combinations of indices summing to at most d feature_combinations = [] for d in range(degree + 1): # All ways to choose d items from n_features (with replacement) for combo in combinations_with_replacement(range(n_features), d): feature_combinations.append(combo) # Build the feature matrix n_output_features = len(feature_combinations) Phi = np.ones((n_samples, n_output_features)) for i, combo in enumerate(feature_combinations): for feature_idx in combo: Phi[:, i] *= X[:, feature_idx] if not include_bias: # Remove the first column (all ones) Phi = Phi[:, 1:] return Phi # Example: 2 variables, degree 2X = np.array([[1, 2], [3, 4], [5, 6]]) Phi = polynomial_features_multi(X, degree=2)print("Original features X:")print(X)print("Polynomial features (degree 2):")print("Columns: [1, x1, x2, x1², x1*x2, x2²]")print(Phi)Fitting Curves as Fitting Hyperplanes:
Polynomial regression transforms a problem of fitting a curve in the original space into fitting a hyperplane in a higher-dimensional feature space.
Example: Quadratic in 1D → Linear in 2D
Consider fitting $y = \beta_0 + \beta_1 x + \beta_2 x^2$ to data. In the original $(x, y)$ space, we're fitting a parabola. But if we map each point $(x_i, y_i)$ to $(z_{1i}, z_{2i}, y_i) = (x_i, x_i^2, y_i)$, we're now fitting a plane in 3D: $$y = \beta_0 + \beta_1 z_1 + \beta_2 z_2$$
This geometric insight explains:
The Column Space Perspective:
Recall that OLS finds $\hat{\mathbf{y}} = \mathbf{\Phi} \hat{\boldsymbol{\beta}}$, the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{\Phi}$.
For polynomial features, the columns of $\mathbf{\Phi}$ are:
We're finding the best linear combination of polynomial basis functions to approximate $\mathbf{y}$. The fitted curve lives in the span of $\{1, x, x^2, \ldots, x^d\}$.
Numerical Stability Issues:
Polynomial features can cause severe numerical problems. Consider $x = 100$ and degree $d = 6$: $$x^6 = 10^{12}$$
This creates massive scale differences between columns of $\mathbf{\Phi}$, leading to:
123456789101112131415161718192021222324252627282930313233
import numpy as np def demonstrate_condition_number(x: np.ndarray, degrees: list): """ Show how condition number grows with polynomial degree. High condition number = numerical instability. """ print("Polynomial Degree vs. Condition Number") print("=" * 45) print(f"{'Degree':>8} | {'Condition Number':>20} | {'Status':>12}") print("-" * 45) for d in degrees: # Build Vandermonde matrix Phi = np.column_stack([x**k for k in range(d + 1)]) # Condition number: ratio of largest to smallest singular value cond = np.linalg.cond(Phi) if cond < 1e3: status = "Good" elif cond < 1e8: status = "Warning" elif cond < 1e15: status = "Poor" else: status = "Singular!" print(f"{d:>8} | {cond:>20.2e} | {status:>12}") # Data spread across [0, 10] - modest rangex = np.linspace(0, 10, 100)demonstrate_condition_number(x, degrees=[1, 2, 3, 5, 8, 10, 15, 20])123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom typing import Tuple def stable_polynomial_fit(x: np.ndarray, y: np.ndarray, degree: int) -> Tuple[np.ndarray, float, float]: """ Numerically stable polynomial fitting using centering/scaling + QR. Parameters: x: Input values y: Target values degree: Polynomial degree Returns: coefficients: Fitted coefficients in the original (unscaled) basis mean: Mean used for centering std: Std used for scaling """ # Step 1: Center and scale x x_mean = np.mean(x) x_std = np.std(x) if x_std < 1e-10: x_std = 1.0 # Prevent division by zero z = (x - x_mean) / x_std # Step 2: Build Vandermonde matrix with scaled data Phi = np.column_stack([z**k for k in range(degree + 1)]) # Step 3: Solve using QR decomposition (more stable than normal equations) Q, R = np.linalg.qr(Phi) beta_scaled = np.linalg.solve(R, Q.T @ y) # Step 4: Convert coefficients back to original scale # This requires careful bookkeeping... # For simplicity, we'll return scaled coefficients and transformation params return beta_scaled, x_mean, x_std def predict_stable(x_new: np.ndarray, beta: np.ndarray, x_mean: float, x_std: float) -> np.ndarray: """ Make predictions using the stable polynomial model. """ z_new = (x_new - x_mean) / x_std degree = len(beta) - 1 Phi_new = np.column_stack([z_new**k for k in range(degree + 1)]) return Phi_new @ beta # Demonstrationnp.random.seed(42)x = np.linspace(-100, 100, 50) # Large range - would be unstable without scalingy = 5 - 0.02*x + 0.001*x**2 + np.random.normal(0, 2, len(x)) beta, x_mean, x_std = stable_polynomial_fit(x, y, degree=2)y_pred = predict_stable(x, beta, x_mean, x_std) print(f"Scaled coefficients: {beta}")print(f"R² score: {1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2):.6f}")Polynomial features are a powerful tool, but not always the right choice. Understanding their strengths and limitations helps you choose appropriately.
Polynomials are notorious for wild behavior outside the training data range. A degree-6 polynomial fitted to data in $[0, 10]$ can predict values of $10^{15}$ at $x=15$! If extrapolation is needed, consider models that extrapolate more gracefully (linear, asymptotic models) or limit polynomial degree severely.
Alternatives to Consider:
| Alternative | When to Prefer Over Polynomials |
|---|---|
| Kernel methods | High-dimensional, complex patterns |
| Splines | Need local control, avoid global wiggles |
| Neural networks | Very complex patterns, lots of data |
| Decision trees | Non-smooth, discrete patterns |
| Fourier features | Periodic/oscillatory patterns |
| Logarithmic/exponential | Known functional form suggests it |
Polynomial features remain excellent for their simplicity, interpretability, and computational efficiency in appropriate settings. The key is recognizing those settings.
Let's walk through a complete polynomial regression example using the classic stopping distance dataset. Physics tells us stopping distance should scale with the square of speed (kinetic energy), making this a natural candidate for quadratic regression.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npimport matplotlib.pyplot as pltfrom typing import Dict # Classic stopping distance data (speed in mph, distance in feet)# From R's 'cars' dataset, a standard benchmark since 1920sspeed = np.array([4, 7, 8, 9, 10, 11, 12, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25])distance = np.array([2, 4, 16, 10, 18, 17, 14, 26, 34, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]) def fit_and_evaluate(x: np.ndarray, y: np.ndarray, degrees: list) -> Dict[int, dict]: """ Fit polynomial models of various degrees and compute metrics. """ # Center and scale for numerical stability x_mean, x_std = np.mean(x), np.std(x) z = (x - x_mean) / x_std results = {} n = len(y) for d in degrees: # Build design matrix Phi = np.column_stack([z**k for k in range(d + 1)]) # Fit using QR decomposition Q, R = np.linalg.qr(Phi) beta = np.linalg.solve(R, Q.T @ y) # Predictions and residuals y_pred = Phi @ beta residuals = y - y_pred # Metrics ss_res = np.sum(residuals**2) ss_tot = np.sum((y - np.mean(y))**2) r_squared = 1 - ss_res / ss_tot # Adjusted R² (penalizes for number of parameters) p = d + 1 # number of parameters adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p - 1) # Root Mean Square Error rmse = np.sqrt(ss_res / n) # Condition number (numerical stability indicator) cond = np.linalg.cond(Phi) results[d] = { 'beta': beta, 'y_pred': y_pred, 'r_squared': r_squared, 'adj_r_squared': adj_r_squared, 'rmse': rmse, 'condition_number': cond, 'x_mean': x_mean, 'x_std': x_std } return results # Fit models of degree 1, 2, 3, 5results = fit_and_evaluate(speed, distance, degrees=[1, 2, 3, 5]) # Print comparisonprint("Polynomial Model Comparison: Stopping Distance")print("=" * 65)print(f"{'Degree':>8} | {'R²':>10} | {'Adj R²':>10} | {'RMSE':>10} | {'Cond #':>12}")print("-" * 65)for d, r in results.items(): print(f"{d:>8} | {r['r_squared']:>10.4f} | {r['adj_r_squared']:>10.4f} | " f"{r['rmse']:>10.2f} | {r['condition_number']:>12.2e}") print("Physics interpretation:")print("- Linear (d=1): Ignores kinetic energy relationship")print("- Quadratic (d=2): Matches physics (KE ~ v²)")print("- Higher degrees: Overfit noise, worse generalization")In this example, physics tells us stopping distance ≈ reaction distance (linear in speed) + braking distance (quadratic in speed). The quadratic model has the best adjusted R² because it captures the true relationship without overfitting. Domain knowledge guides optimal degree selection!
We've established the mathematical and computational foundation for polynomial regression. Let's consolidate the key concepts:
What's Next:
Now that we understand how to construct polynomial features, the crucial question becomes: How do we choose the right polynomial degree? Too low misses the pattern; too high overfits the noise. The next page develops rigorous methods for degree selection, including cross-validation, information criteria, and the bias-variance perspective on polynomial complexity.
You now understand the mathematical framework of polynomial features: how they extend linear regression to capture nonlinear patterns while preserving computational tractability. Next, we'll tackle the critical challenge of selecting optimal polynomial degree.