Loading learning content...
Linear regression is one of the most powerful tools in machine learning, but it comes with a fundamental limitation: it assumes the relationship between features and target is a straight line. In reality, most phenomena in the physical world, economics, biology, and engineering exhibit curved, nonlinear patterns.\n\nConsider the following examples:\n- Physics: Projectile motion follows a parabolic trajectory (quadratic in time)\n- Economics: Diminishing returns create logarithmic or polynomial relationships\n- Biology: Population growth often follows sigmoidal or polynomial patterns\n- Engineering: Stress-strain relationships in materials exhibit nonlinear behavior\n\nPolynomial regression elegantly bridges this gap by extending linear regression to model nonlinear relationships—while keeping all the computational conveniences of linear algebra intact.
By the end of this page, you will understand: (1) The mathematical formulation of polynomial features; (2) How polynomial regression remains 'linear' in a specific mathematical sense; (3) The complete feature transformation process; (4) Computational considerations and the design matrix structure; (5) When polynomial features are appropriate vs. other nonlinear methods.
The key insight that makes polynomial regression work is the distinction between linearity in features and linearity in parameters. When we say linear regression is 'linear,' we mean the model output is a linear combination of the parameters—not necessarily a linear function of the input.\n\nStandard Linear Regression:\n$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon$$\n\nThis model is linear in both the parameters $(\beta_0, \beta_1, \ldots, \beta_p)$ and the features $(x_1, x_2, \ldots, x_p)$.\n\nPolynomial Regression (degree 2):\n$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon$$\n\nThis model is nonlinear in the input $x$, but critically, it remains linear in the parameters $(\beta_0, \beta_1, \beta_2)$. This linearity in parameters is what matters for applying all the machinery of linear regression.
If we define new features $z_1 = x$ and $z_2 = x^2$, the polynomial model becomes:\n$$y = \beta_0 + \beta_1 z_1 + \beta_2 z_2 + \epsilon$$\nThis is just ordinary linear regression in the transformed feature space! All the tools we developed for linear regression—OLS, gradient descent, closed-form solutions—work without modification.
Why This Matters Mathematically:\n\nThe normal equations for OLS require the model to be linear in parameters to yield a closed-form solution:\n$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$\n\nFor this to work, the design matrix $\mathbf{X}$ can contain any transformations of the original inputs—squares, cubes, cross products—as long as the model remains:\n$$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$\n\nwhere $\boldsymbol{\beta}$ appears linearly. This is the foundation of all feature engineering approaches that extend linear models.
Single Variable Case:\n\nGiven a single input variable $x$ and a polynomial of degree $d$, the polynomial feature expansion creates $d+1$ features (including the intercept):\n\n$$\phi(x) = [1, x, x^2, x^3, \ldots, x^d]^T$$\n\nThe polynomial regression model becomes:\n$$y = \beta_0 \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2 + \ldots + \beta_d \cdot x^d + \epsilon$$\n\nOr in matrix form for $n$ observations:\n$$\mathbf{y} = \mathbf{\Phi} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$\n\nwhere $\mathbf{\Phi}$ is the Vandermonde matrix:\n\n$$\mathbf{\Phi} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom typing import Tuple def polynomial_features_1d(x: np.ndarray, degree: int) -> np.ndarray: """ Generate polynomial features for a single variable. Parameters: x: Input array of shape (n_samples,) or (n_samples, 1) degree: Maximum polynomial degree Returns: Feature matrix of shape (n_samples, degree + 1) Columns are [1, x, x^2, ..., x^degree] """ x = np.asarray(x).flatten() n = len(x) # Build Vandermonde matrix # Each column is x raised to the corresponding power Phi = np.column_stack([x**k for k in range(degree + 1)]) return Phi def polynomial_regression_fit(x: np.ndarray, y: np.ndarray, degree: int) -> Tuple[np.ndarray, np.ndarray]: """ Fit polynomial regression using the normal equations. Parameters: x: Input features (n_samples,) y: Target values (n_samples,) degree: Polynomial degree Returns: beta: Fitted coefficients (degree + 1,) Phi: Design matrix (n_samples, degree + 1) """ # Generate polynomial features Phi = polynomial_features_1d(x, degree) # Normal equations: beta = (Phi^T @ Phi)^{-1} @ Phi^T @ y # Using numpy's lstsq for numerical stability beta, residuals, rank, s = np.linalg.lstsq(Phi, y, rcond=None) return beta, Phi # Example: Fitting a quadratic to noisy datanp.random.seed(42)x = np.linspace(-3, 3, 50)y_true = 2 - 0.5*x + 0.8*x**2 # True quadraticy = y_true + np.random.normal(0, 0.5, size=len(x)) # Add noise # Fit polynomial of degree 2beta, Phi = polynomial_regression_fit(x, y, degree=2)print(f"Fitted coefficients: β₀={beta[0]:.3f}, β₁={beta[1]:.3f}, β₂={beta[2]:.3f}")print(f"True coefficients: β₀=2.000, β₁=-0.500, β₂=0.800")The Vandermonde matrix is named after Alexandre-Théophile Vandermonde. It appears in polynomial interpolation, discrete Fourier transforms, and numerous other applications. Its determinant has a beautiful closed form: $\det(\mathbf{V}) = \prod_{i<j}(x_j - x_i)$, which is non-zero when all $x_i$ are distinct.
Extension to Multiple Variables:\n\nWhen we have $p$ input features $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, polynomial feature expansion becomes more complex because we must consider interaction terms (cross-products between different variables).\n\nFor degree $d$ and $p$ variables, we include all monomials of the form:\n$$x_1^{k_1} \cdot x_2^{k_2} \cdot \ldots \cdot x_p^{k_p}$$\n\nwhere $k_1 + k_2 + \ldots + k_p \leq d$ and each $k_i \geq 0$.\n\nExample: Two Variables, Degree 2:\n\nFor $\mathbf{x} = (x_1, x_2)$ with $d=2$, the polynomial features are:\n$$\phi(\mathbf{x}) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]^T$$\n\n- $1$: bias term (degree 0)\n- $x_1, x_2$: linear terms (degree 1)\n- $x_1^2, x_2^2$: pure quadratic terms (degree 2)\n- $x_1 x_2$: interaction term (degree 2, captures joint effect)
| Variables (p) | Degree 2 | Degree 3 | Degree 4 | Degree 5 |
|---|---|---|---|---|
| 1 | 3 | 4 | 5 | 6 |
| 2 | 6 | 10 | 15 | 21 |
| 3 | 10 | 20 | 35 | 56 |
| 5 | 21 | 56 | 126 | 252 |
| 10 | 66 | 286 | 1,001 | 3,003 |
| 20 | 231 | 1,771 | 10,626 | 53,130 |
The Combinatorial Formula:\n\nThe number of polynomial features for $p$ variables up to degree $d$ (including the intercept) is:\n$$N_{\text{features}} = \binom{p + d}{d} = \frac{(p + d)!}{p! \cdot d!}$$\n\nThis grows rapidly with both the number of variables and the polynomial degree—a phenomenon known as the curse of dimensionality in the polynomial feature space.
With 20 features and degree 5, you generate over 53,000 polynomial features! This exponential growth means polynomial features work best for low-dimensional problems or when combined with regularization. For high-dimensional data, consider kernel methods or neural networks instead.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npfrom itertools import combinations_with_replacement def polynomial_features_multi(X: np.ndarray, degree: int, include_bias: bool = True) -> np.ndarray: """ Generate polynomial features for multiple variables. Parameters: X: Input array of shape (n_samples, n_features) degree: Maximum polynomial degree include_bias: Whether to include the bias (all-ones) column Returns: Feature matrix with all polynomial terms up to degree d """ n_samples, n_features = X.shape # Generate all combinations of feature indices with repetition # For degree d, we need all combinations of indices summing to at most d feature_combinations = [] for d in range(degree + 1): # All ways to choose d items from n_features (with replacement) for combo in combinations_with_replacement(range(n_features), d): feature_combinations.append(combo) # Build the feature matrix n_output_features = len(feature_combinations) Phi = np.ones((n_samples, n_output_features)) for i, combo in enumerate(feature_combinations): for feature_idx in combo: Phi[:, i] *= X[:, feature_idx] if not include_bias: # Remove the first column (all ones) Phi = Phi[:, 1:] return Phi # Example: 2 variables, degree 2X = np.array([[1, 2], [3, 4], [5, 6]]) Phi = polynomial_features_multi(X, degree=2)print("Original features X:")print(X)print("\nPolynomial features (degree 2):")print("Columns: [1, x1, x2, x1², x1*x2, x2²]")print(Phi)Fitting Curves as Fitting Hyperplanes:\n\nPolynomial regression transforms a problem of fitting a curve in the original space into fitting a hyperplane in a higher-dimensional feature space.\n\nExample: Quadratic in 1D → Linear in 2D\n\nConsider fitting $y = \beta_0 + \beta_1 x + \beta_2 x^2$ to data. In the original $(x, y)$ space, we're fitting a parabola. But if we map each point $(x_i, y_i)$ to $(z_{1i}, z_{2i}, y_i) = (x_i, x_i^2, y_i)$, we're now fitting a plane in 3D:\n$$y = \beta_0 + \beta_1 z_1 + \beta_2 z_2$$\n\nThis geometric insight explains:\n1. Why OLS works: We're projecting $\mathbf{y}$ onto the column space of $\mathbf{\Phi}$\n2. Why regularization helps: Higher-degree polynomials have more dimensions to 'wiggles' through\n3. Why we can interpret coefficients: Each coefficient measures the effect of that feature (though interpretation gets tricky)
The Column Space Perspective:\n\nRecall that OLS finds $\hat{\mathbf{y}} = \mathbf{\Phi} \hat{\boldsymbol{\beta}}$, the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{\Phi}$.\n\nFor polynomial features, the columns of $\mathbf{\Phi}$ are:\n- Column 1: $\mathbf{1} = [1, 1, \ldots, 1]^T$ (constant function)\n- Column 2: $\mathbf{x} = [x_1, x_2, \ldots, x_n]^T$ (linear function)\n- Column 3: $\mathbf{x}^2 = [x_1^2, x_2^2, \ldots, x_n^2]^T$ (quadratic function)\n- And so on...\n\nWe're finding the best linear combination of polynomial basis functions to approximate $\mathbf{y}$. The fitted curve lives in the span of $\{1, x, x^2, \ldots, x^d\}$.
Numerical Stability Issues:\n\nPolynomial features can cause severe numerical problems. Consider $x = 100$ and degree $d = 6$:\n$$x^6 = 10^{12}$$\n\nThis creates massive scale differences between columns of $\mathbf{\Phi}$, leading to:\n1. Ill-conditioned matrices: $(\mathbf{\Phi}^T \mathbf{\Phi})$ becomes nearly singular\n2. Floating-point overflow: Large powers exceed representable range\n3. Loss of precision: Subtraction of nearly-equal large numbers\n4. Unstable solutions: Small changes in data cause large coefficient changes
123456789101112131415161718192021222324252627282930313233
import numpy as np def demonstrate_condition_number(x: np.ndarray, degrees: list): """ Show how condition number grows with polynomial degree. High condition number = numerical instability. """ print("Polynomial Degree vs. Condition Number") print("=" * 45) print(f"{'Degree':>8} | {'Condition Number':>20} | {'Status':>12}") print("-" * 45) for d in degrees: # Build Vandermonde matrix Phi = np.column_stack([x**k for k in range(d + 1)]) # Condition number: ratio of largest to smallest singular value cond = np.linalg.cond(Phi) if cond < 1e3: status = "Good" elif cond < 1e8: status = "Warning" elif cond < 1e15: status = "Poor" else: status = "Singular!" print(f"{d:>8} | {cond:>20.2e} | {status:>12}") # Data spread across [0, 10] - modest rangex = np.linspace(0, 10, 100)demonstrate_condition_number(x, degrees=[1, 2, 3, 5, 8, 10, 15, 20])123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom typing import Tuple def stable_polynomial_fit(x: np.ndarray, y: np.ndarray, degree: int) -> Tuple[np.ndarray, float, float]: """ Numerically stable polynomial fitting using centering/scaling + QR. Parameters: x: Input values y: Target values degree: Polynomial degree Returns: coefficients: Fitted coefficients in the original (unscaled) basis mean: Mean used for centering std: Std used for scaling """ # Step 1: Center and scale x x_mean = np.mean(x) x_std = np.std(x) if x_std < 1e-10: x_std = 1.0 # Prevent division by zero z = (x - x_mean) / x_std # Step 2: Build Vandermonde matrix with scaled data Phi = np.column_stack([z**k for k in range(degree + 1)]) # Step 3: Solve using QR decomposition (more stable than normal equations) Q, R = np.linalg.qr(Phi) beta_scaled = np.linalg.solve(R, Q.T @ y) # Step 4: Convert coefficients back to original scale # This requires careful bookkeeping... # For simplicity, we'll return scaled coefficients and transformation params return beta_scaled, x_mean, x_std def predict_stable(x_new: np.ndarray, beta: np.ndarray, x_mean: float, x_std: float) -> np.ndarray: """ Make predictions using the stable polynomial model. """ z_new = (x_new - x_mean) / x_std degree = len(beta) - 1 Phi_new = np.column_stack([z_new**k for k in range(degree + 1)]) return Phi_new @ beta # Demonstrationnp.random.seed(42)x = np.linspace(-100, 100, 50) # Large range - would be unstable without scalingy = 5 - 0.02*x + 0.001*x**2 + np.random.normal(0, 2, len(x)) beta, x_mean, x_std = stable_polynomial_fit(x, y, degree=2)y_pred = predict_stable(x, beta, x_mean, x_std) print(f"Scaled coefficients: {beta}")print(f"R² score: {1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2):.6f}")Polynomial features are a powerful tool, but not always the right choice. Understanding their strengths and limitations helps you choose appropriately.
Polynomials are notorious for wild behavior outside the training data range. A degree-6 polynomial fitted to data in $[0, 10]$ can predict values of $10^{15}$ at $x=15$! If extrapolation is needed, consider models that extrapolate more gracefully (linear, asymptotic models) or limit polynomial degree severely.
Alternatives to Consider:\n\n| Alternative | When to Prefer Over Polynomials |\n|-------------|--------------------------------|\n| Kernel methods | High-dimensional, complex patterns |\n| Splines | Need local control, avoid global wiggles |\n| Neural networks | Very complex patterns, lots of data |\n| Decision trees | Non-smooth, discrete patterns |\n| Fourier features | Periodic/oscillatory patterns |\n| Logarithmic/exponential | Known functional form suggests it |\n\nPolynomial features remain excellent for their simplicity, interpretability, and computational efficiency in appropriate settings. The key is recognizing those settings.
Let's walk through a complete polynomial regression example using the classic stopping distance dataset. Physics tells us stopping distance should scale with the square of speed (kinetic energy), making this a natural candidate for quadratic regression.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npimport matplotlib.pyplot as pltfrom typing import Dict # Classic stopping distance data (speed in mph, distance in feet)# From R's 'cars' dataset, a standard benchmark since 1920sspeed = np.array([4, 7, 8, 9, 10, 11, 12, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25])distance = np.array([2, 4, 16, 10, 18, 17, 14, 26, 34, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]) def fit_and_evaluate(x: np.ndarray, y: np.ndarray, degrees: list) -> Dict[int, dict]: """ Fit polynomial models of various degrees and compute metrics. """ # Center and scale for numerical stability x_mean, x_std = np.mean(x), np.std(x) z = (x - x_mean) / x_std results = {} n = len(y) for d in degrees: # Build design matrix Phi = np.column_stack([z**k for k in range(d + 1)]) # Fit using QR decomposition Q, R = np.linalg.qr(Phi) beta = np.linalg.solve(R, Q.T @ y) # Predictions and residuals y_pred = Phi @ beta residuals = y - y_pred # Metrics ss_res = np.sum(residuals**2) ss_tot = np.sum((y - np.mean(y))**2) r_squared = 1 - ss_res / ss_tot # Adjusted R² (penalizes for number of parameters) p = d + 1 # number of parameters adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p - 1) # Root Mean Square Error rmse = np.sqrt(ss_res / n) # Condition number (numerical stability indicator) cond = np.linalg.cond(Phi) results[d] = { 'beta': beta, 'y_pred': y_pred, 'r_squared': r_squared, 'adj_r_squared': adj_r_squared, 'rmse': rmse, 'condition_number': cond, 'x_mean': x_mean, 'x_std': x_std } return results # Fit models of degree 1, 2, 3, 5results = fit_and_evaluate(speed, distance, degrees=[1, 2, 3, 5]) # Print comparisonprint("Polynomial Model Comparison: Stopping Distance")print("=" * 65)print(f"{'Degree':>8} | {'R²':>10} | {'Adj R²':>10} | {'RMSE':>10} | {'Cond #':>12}")print("-" * 65)for d, r in results.items(): print(f"{d:>8} | {r['r_squared']:>10.4f} | {r['adj_r_squared']:>10.4f} | " f"{r['rmse']:>10.2f} | {r['condition_number']:>12.2e}") print("\nPhysics interpretation:")print("- Linear (d=1): Ignores kinetic energy relationship")print("- Quadratic (d=2): Matches physics (KE ~ v²)")print("- Higher degrees: Overfit noise, worse generalization")In this example, physics tells us stopping distance ≈ reaction distance (linear in speed) + braking distance (quadratic in speed). The quadratic model has the best adjusted R² because it captures the true relationship without overfitting. Domain knowledge guides optimal degree selection!
We've established the mathematical and computational foundation for polynomial regression. Let's consolidate the key concepts:
What's Next:\n\nNow that we understand how to construct polynomial features, the crucial question becomes: How do we choose the right polynomial degree? Too low misses the pattern; too high overfits the noise. The next page develops rigorous methods for degree selection, including cross-validation, information criteria, and the bias-variance perspective on polynomial complexity.
You now understand the mathematical framework of polynomial features: how they extend linear regression to capture nonlinear patterns while preserving computational tractability. Next, we'll tackle the critical challenge of selecting optimal polynomial degree.