Machine LearningAdvanced Regression

Polynomial Regression

LevelIntermediate

Duration90 mins

TopicAdvanced Regression

1 / 5

Polynomial Features

Beyond Linearity: The Need for Polynomial Models

Linear regression is one of the most powerful tools in machine learning, but it comes with a fundamental limitation: it assumes the relationship between features and target is a straight line. In reality, most phenomena in the physical world, economics, biology, and engineering exhibit curved, nonlinear patterns.\n\nConsider the following examples:\n- Physics: Projectile motion follows a parabolic trajectory (quadratic in time)\n- Economics: Diminishing returns create logarithmic or polynomial relationships\n- Biology: Population growth often follows sigmoidal or polynomial patterns\n- Engineering: Stress-strain relationships in materials exhibit nonlinear behavior\n\nPolynomial regression elegantly bridges this gap by extending linear regression to model nonlinear relationships—while keeping all the computational conveniences of linear algebra intact.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical formulation of polynomial features; (2) How polynomial regression remains 'linear' in a specific mathematical sense; (3) The complete feature transformation process; (4) Computational considerations and the design matrix structure; (5) When polynomial features are appropriate vs. other nonlinear methods.

The Fundamental Insight: Linearity in Parameters

The key insight that makes polynomial regression work is the distinction between linearity in features and linearity in parameters. When we say linear regression is 'linear,' we mean the model output is a linear combination of the parameters—not necessarily a linear function of the input.\n\nStandard Linear Regression:\n$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon$$\n\nThis model is linear in both the parameters $(\beta_0, \beta_1, \ldots, \beta_p)$ and the features $(x_1, x_2, \ldots, x_p)$.\n\nPolynomial Regression (degree 2):\n$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon$$\n\nThis model is nonlinear in the input $x$, but critically, it remains linear in the parameters $(\beta_0, \beta_1, \beta_2)$. This linearity in parameters is what matters for applying all the machinery of linear regression.

The Feature Transformation Trick

If we define new features $z_1 = x$ and $z_2 = x^2$, the polynomial model becomes:\n$$y = \beta_0 + \beta_1 z_1 + \beta_2 z_2 + \epsilon$$\nThis is just ordinary linear regression in the transformed feature space! All the tools we developed for linear regression—OLS, gradient descent, closed-form solutions—work without modification.

Why This Matters Mathematically:\n\nThe normal equations for OLS require the model to be linear in parameters to yield a closed-form solution:\n$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$\n\nFor this to work, the design matrix $\mathbf{X}$ can contain any transformations of the original inputs—squares, cubes, cross products—as long as the model remains:\n$$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$\n\nwhere $\boldsymbol{\beta}$ appears linearly. This is the foundation of all feature engineering approaches that extend linear models.

Polynomial Feature Expansion

Single Variable Case:\n\nGiven a single input variable $x$ and a polynomial of degree $d$, the polynomial feature expansion creates $d+1$ features (including the intercept):\n\n$$\phi(x) = [1, x, x^2, x^3, \ldots, x^d]^T$$\n\nThe polynomial regression model becomes:\n$$y = \beta_0 \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2 + \ldots + \beta_d \cdot x^d + \epsilon$$\n\nOr in matrix form for $n$ observations:\n$$\mathbf{y} = \mathbf{\Phi} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$\n\nwhere $\mathbf{\Phi}$ is the Vandermonde matrix:\n\n$$\mathbf{\Phi} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}$$

polynomial_features.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from typing import Tuple
 
def polynomial_features_1d(x: np.ndarray, degree: int) -> np.ndarray:
    """
    Generate polynomial features for a single variable.
    
    Parameters:
        x: Input array of shape (n_samples,) or (n_samples, 1)
        degree: Maximum polynomial degree
        
    Returns:
        Feature matrix of shape (n_samples, degree + 1)
        Columns are [1, x, x^2, ..., x^degree]
    """
    x = np.asarray(x).flatten()
    n = len(x)
    
    # Build Vandermonde matrix
    # Each column is x raised to the corresponding power
    Phi = np.column_stack([x**k for k in range(degree + 1)])
    
    return Phi
 
def polynomial_regression_fit(x: np.ndarray, y: np.ndarray, 
                               degree: int) -> Tuple[np.ndarray, np.ndarray]:
    """
    Fit polynomial regression using the normal equations.
    
    Parameters:
        x: Input features (n_samples,)
        y: Target values (n_samples,)
        degree: Polynomial degree
        
    Returns:
        beta: Fitted coefficients (degree + 1,)
        Phi: Design matrix (n_samples, degree + 1)
    """
    # Generate polynomial features
    Phi = polynomial_features_1d(x, degree)
    
    # Normal equations: beta = (Phi^T @ Phi)^{-1} @ Phi^T @ y
    # Using numpy's lstsq for numerical stability
    beta, residuals, rank, s = np.linalg.lstsq(Phi, y, rcond=None)
    
    return beta, Phi
 
# Example: Fitting a quadratic to noisy data
np.random.seed(42)
x = np.linspace(-3, 3, 50)
y_true = 2 - 0.5*x + 0.8*x**2  # True quadratic
y = y_true + np.random.normal(0, 0.5, size=len(x))  # Add noise
 
# Fit polynomial of degree 2
beta, Phi = polynomial_regression_fit(x, y, degree=2)
print(f"Fitted coefficients: β₀={beta[0]:.3f}, β₁={beta[1]:.3f}, β₂={beta[2]:.3f}")
print(f"True coefficients:   β₀=2.000, β₁=-0.500, β₂=0.800")

The Vandermonde Matrix

The Vandermonde matrix is named after Alexandre-Théophile Vandermonde. It appears in polynomial interpolation, discrete Fourier transforms, and numerous other applications. Its determinant has a beautiful closed form: $\det(\mathbf{V}) = \prod_{i<j}(x_j - x_i)$, which is non-zero when all $x_i$ are distinct.

Multivariate Polynomial Features

Extension to Multiple Variables:\n\nWhen we have $p$ input features $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, polynomial feature expansion becomes more complex because we must consider interaction terms (cross-products between different variables).\n\nFor degree $d$ and $p$ variables, we include all monomials of the form:\n$$x_1^{k_1} \cdot x_2^{k_2} \cdot \ldots \cdot x_p^{k_p}$$\n\nwhere $k_1 + k_2 + \ldots + k_p \leq d$ and each $k_i \geq 0$.\n\nExample: Two Variables, Degree 2:\n\nFor $\mathbf{x} = (x_1, x_2)$ with $d=2$, the polynomial features are:\n$$\phi(\mathbf{x}) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]^T$$\n\n- $1$: bias term (degree 0)\n- $x_1, x_2$: linear terms (degree 1)\n- $x_1^2, x_2^2$: pure quadratic terms (degree 2)\n- $x_1 x_2$: interaction term (degree 2, captures joint effect)

Number of Polynomial Features: (p variables, degree d)
Variables (p)	Degree 2	Degree 3	Degree 4	Degree 5
1	3	4	5	6
2	6	10	15	21
3	10	20	35	56
5	21	56	126	252
10	66	286	1,001	3,003
20	231	1,771	10,626	53,130

The Combinatorial Formula:\n\nThe number of polynomial features for $p$ variables up to degree $d$ (including the intercept) is:\n$$N_{\text{features}} = \binom{p + d}{d} = \frac{(p + d)!}{p! \cdot d!}$$\n\nThis grows rapidly with both the number of variables and the polynomial degree—a phenomenon known as the curse of dimensionality in the polynomial feature space.

Feature Explosion

With 20 features and degree 5, you generate over 53,000 polynomial features! This exponential growth means polynomial features work best for low-dimensional problems or when combined with regularization. For high-dimensional data, consider kernel methods or neural networks instead.

multivariate_polynomial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from itertools import combinations_with_replacement
 
def polynomial_features_multi(X: np.ndarray, degree: int, 
                               include_bias: bool = True) -> np.ndarray:
    """
    Generate polynomial features for multiple variables.
    
    Parameters:
        X: Input array of shape (n_samples, n_features)
        degree: Maximum polynomial degree
        include_bias: Whether to include the bias (all-ones) column
        
    Returns:
        Feature matrix with all polynomial terms up to degree d
    """
    n_samples, n_features = X.shape
    
    # Generate all combinations of feature indices with repetition
    # For degree d, we need all combinations of indices summing to at most d
    feature_combinations = []
    
    for d in range(degree + 1):
        # All ways to choose d items from n_features (with replacement)
        for combo in combinations_with_replacement(range(n_features), d):
            feature_combinations.append(combo)
    
    # Build the feature matrix
    n_output_features = len(feature_combinations)
    Phi = np.ones((n_samples, n_output_features))
    
    for i, combo in enumerate(feature_combinations):
        for feature_idx in combo:
            Phi[:, i] *= X[:, feature_idx]
    
    if not include_bias:
        # Remove the first column (all ones)
        Phi = Phi[:, 1:]
    
    return Phi
 
# Example: 2 variables, degree 2
X = np.array([[1, 2],
              [3, 4],
              [5, 6]])
 
Phi = polynomial_features_multi(X, degree=2)
print("Original features X:")
print(X)
print("\nPolynomial features (degree 2):")
print("Columns: [1, x1, x2, x1², x1*x2, x2²]")
print(Phi)

Geometric Interpretation

Fitting Curves as Fitting Hyperplanes:\n\nPolynomial regression transforms a problem of fitting a curve in the original space into fitting a hyperplane in a higher-dimensional feature space.\n\nExample: Quadratic in 1D → Linear in 2D\n\nConsider fitting $y = \beta_0 + \beta_1 x + \beta_2 x^2$ to data. In the original $(x, y)$ space, we're fitting a parabola. But if we map each point $(x_i, y_i)$ to $(z_{1i}, z_{2i}, y_i) = (x_i, x_i^2, y_i)$, we're now fitting a plane in 3D:\n$$y = \beta_0 + \beta_1 z_1 + \beta_2 z_2$$\n\nThis geometric insight explains:\n1. Why OLS works: We're projecting $\mathbf{y}$ onto the column space of $\mathbf{\Phi}$\n2. Why regularization helps: Higher-degree polynomials have more dimensions to 'wiggles' through\n3. Why we can interpret coefficients: Each coefficient measures the effect of that feature (though interpretation gets tricky)

In Original Space

•Data points: $(x_i, y_i)$
•Model: polynomial curve $y = f(x)$
•Fitting: find curve minimizing vertical distances
•Nonlinear fitting problem (in appearance)

In Feature Space

•Data points: $(\phi(x_i), y_i)$
•Model: hyperplane $y = \boldsymbol{\beta}^T \phi(x)$
•Fitting: find hyperplane minimizing vertical distances
•Linear fitting problem (OLS applies directly)

The Column Space Perspective:\n\nRecall that OLS finds $\hat{\mathbf{y}} = \mathbf{\Phi} \hat{\boldsymbol{\beta}}$, the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{\Phi}$.\n\nFor polynomial features, the columns of $\mathbf{\Phi}$ are:\n- Column 1: $\mathbf{1} = [1, 1, \ldots, 1]^T$ (constant function)\n- Column 2: $\mathbf{x} = [x_1, x_2, \ldots, x_n]^T$ (linear function)\n- Column 3: $\mathbf{x}^2 = [x_1^2, x_2^2, \ldots, x_n^2]^T$ (quadratic function)\n- And so on...\n\nWe're finding the best linear combination of polynomial basis functions to approximate $\mathbf{y}$. The fitted curve lives in the span of $\{1, x, x^2, \ldots, x^d\}$.

Computational Considerations

Numerical Stability Issues:\n\nPolynomial features can cause severe numerical problems. Consider $x = 100$ and degree $d = 6$:\n$$x^6 = 10^{12}$$\n\nThis creates massive scale differences between columns of $\mathbf{\Phi}$, leading to:\n1. Ill-conditioned matrices: $(\mathbf{\Phi}^T \mathbf{\Phi})$ becomes nearly singular\n2. Floating-point overflow: Large powers exceed representable range\n3. Loss of precision: Subtraction of nearly-equal large numbers\n4. Unstable solutions: Small changes in data cause large coefficient changes

numerical_issues.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def demonstrate_condition_number(x: np.ndarray, degrees: list):
    """
    Show how condition number grows with polynomial degree.
    High condition number = numerical instability.
    """
    print("Polynomial Degree vs. Condition Number")
    print("=" * 45)
    print(f"{'Degree':>8} | {'Condition Number':>20} | {'Status':>12}")
    print("-" * 45)
    
    for d in degrees:
        # Build Vandermonde matrix
        Phi = np.column_stack([x**k for k in range(d + 1)])
        
        # Condition number: ratio of largest to smallest singular value
        cond = np.linalg.cond(Phi)
        
        if cond < 1e3:
            status = "Good"
        elif cond < 1e8:
            status = "Warning"
        elif cond < 1e15:
            status = "Poor"
        else:
            status = "Singular!"
        
        print(f"{d:>8} | {cond:>20.2e} | {status:>12}")
 
# Data spread across [0, 10] - modest range
x = np.linspace(0, 10, 100)
demonstrate_condition_number(x, degrees=[1, 2, 3, 5, 8, 10, 15, 20])

Mitigating Numerical Issues

•Feature Scaling: Center and scale features before computing powers. If $x \in [0, 100]$, transform to $z = (x - 50) / 50 \in [-1, 1]$. Powers of values in $[-1, 1]$ stay bounded.
•Orthogonal Polynomials: Use Legendre, Chebyshev, or Hermite polynomials instead of raw monomials. These are designed to be numerically stable (covered in detail later in this module).
•QR Decomposition: Never compute $(\mathbf{\Phi}^T \mathbf{\Phi})^{-1}$ directly. Use QR decomposition: $\mathbf{\Phi} = \mathbf{Q}\mathbf{R}$, then solve $\mathbf{R}\boldsymbol{\beta} = \mathbf{Q}^T \mathbf{y}$.
•Regularization: Adding $\lambda \mathbf{I}$ to $\mathbf{\Phi}^T \mathbf{\Phi}$ improves conditioning and prevents singular matrices.
•Limit Degree: Rarely use degree > 5-10 for raw polynomial features. Higher degrees almost always require orthogonal bases.

stable_polynomial_fit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from typing import Tuple
 
def stable_polynomial_fit(x: np.ndarray, y: np.ndarray, 
                           degree: int) -> Tuple[np.ndarray, float, float]:
    """
    Numerically stable polynomial fitting using centering/scaling + QR.
    
    Parameters:
        x: Input values
        y: Target values
        degree: Polynomial degree
        
    Returns:
        coefficients: Fitted coefficients in the original (unscaled) basis
        mean: Mean used for centering
        std: Std used for scaling
    """
    # Step 1: Center and scale x
    x_mean = np.mean(x)
    x_std = np.std(x)
    if x_std < 1e-10:
        x_std = 1.0  # Prevent division by zero
    z = (x - x_mean) / x_std
    
    # Step 2: Build Vandermonde matrix with scaled data
    Phi = np.column_stack([z**k for k in range(degree + 1)])
    
    # Step 3: Solve using QR decomposition (more stable than normal equations)
    Q, R = np.linalg.qr(Phi)
    beta_scaled = np.linalg.solve(R, Q.T @ y)
    
    # Step 4: Convert coefficients back to original scale
    # This requires careful bookkeeping...
    # For simplicity, we'll return scaled coefficients and transformation params
    return beta_scaled, x_mean, x_std
 
def predict_stable(x_new: np.ndarray, beta: np.ndarray, 
                   x_mean: float, x_std: float) -> np.ndarray:
    """
    Make predictions using the stable polynomial model.
    """
    z_new = (x_new - x_mean) / x_std
    degree = len(beta) - 1
    Phi_new = np.column_stack([z_new**k for k in range(degree + 1)])
    return Phi_new @ beta
 
# Demonstration
np.random.seed(42)
x = np.linspace(-100, 100, 50)  # Large range - would be unstable without scaling
y = 5 - 0.02*x + 0.001*x**2 + np.random.normal(0, 2, len(x))
 
beta, x_mean, x_std = stable_polynomial_fit(x, y, degree=2)
y_pred = predict_stable(x, beta, x_mean, x_std)
 
print(f"Scaled coefficients: {beta}")
print(f"R² score: {1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2):.6f}")

When to Use Polynomial Features

Polynomial features are a powerful tool, but not always the right choice. Understanding their strengths and limitations helps you choose appropriately.

Good Use Cases

•Low-dimensional data: Few features (1-5) where feature explosion is manageable
•Known polynomial relationships: Physics, engineering with known functional forms
•Smooth, continuous patterns: Data without abrupt changes or discontinuities
•Moderate nonlinearity: Gentle curves rather than complex patterns
•Sufficient data: Enough samples relative to feature count after expansion
•Interpolation focus: Predicting within the range of training data

Poor Use Cases

•High-dimensional data: Many features lead to feature explosion
•Unknown complexity: When the underlying function could be anything
•Discontinuous patterns: Step functions, jumps, regime changes
•Extrapolation needs: Polynomials diverge wildly outside training range
•Limited data: Easy to overfit with many polynomial terms
•Periodic patterns: Use Fourier features instead of polynomials

The Extrapolation Problem

Polynomials are notorious for wild behavior outside the training data range. A degree-6 polynomial fitted to data in $[0, 10]$ can predict values of $10^{15}$ at $x=15$! If extrapolation is needed, consider models that extrapolate more gracefully (linear, asymptotic models) or limit polynomial degree severely.

Alternatives to Consider:\n\n| Alternative | When to Prefer Over Polynomials |\n|-------------|--------------------------------|\n| Kernel methods | High-dimensional, complex patterns |\n| Splines | Need local control, avoid global wiggles |\n| Neural networks | Very complex patterns, lots of data |\n| Decision trees | Non-smooth, discrete patterns |\n| Fourier features | Periodic/oscillatory patterns |\n| Logarithmic/exponential | Known functional form suggests it |\n\nPolynomial features remain excellent for their simplicity, interpretability, and computational efficiency in appropriate settings. The key is recognizing those settings.

Complete Example: Fitting Motor Vehicle Stopping Distance

Let's walk through a complete polynomial regression example using the classic stopping distance dataset. Physics tells us stopping distance should scale with the square of speed (kinetic energy), making this a natural candidate for quadratic regression.

stopping_distance_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict
 
# Classic stopping distance data (speed in mph, distance in feet)
# From R's 'cars' dataset, a standard benchmark since 1920s
speed = np.array([4, 7, 8, 9, 10, 11, 12, 13, 13, 14, 14, 14, 14, 15, 15, 15, 
                  16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 
                  20, 20, 22, 23, 24, 24, 24, 24, 25])
distance = np.array([2, 4, 16, 10, 18, 17, 14, 26, 34, 26, 36, 60, 80, 20, 26, 
                     54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 
                     52, 56, 64, 66, 54, 70, 92, 93, 120, 85])
 
def fit_and_evaluate(x: np.ndarray, y: np.ndarray, 
                     degrees: list) -> Dict[int, dict]:
    """
    Fit polynomial models of various degrees and compute metrics.
    """
    # Center and scale for numerical stability
    x_mean, x_std = np.mean(x), np.std(x)
    z = (x - x_mean) / x_std
    
    results = {}
    n = len(y)
    
    for d in degrees:
        # Build design matrix
        Phi = np.column_stack([z**k for k in range(d + 1)])
        
        # Fit using QR decomposition
        Q, R = np.linalg.qr(Phi)
        beta = np.linalg.solve(R, Q.T @ y)
        
        # Predictions and residuals
        y_pred = Phi @ beta
        residuals = y - y_pred
        
        # Metrics
        ss_res = np.sum(residuals**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        r_squared = 1 - ss_res / ss_tot
        
        # Adjusted R² (penalizes for number of parameters)
        p = d + 1  # number of parameters
        adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p - 1)
        
        # Root Mean Square Error
        rmse = np.sqrt(ss_res / n)
        
        # Condition number (numerical stability indicator)
        cond = np.linalg.cond(Phi)
        
        results[d] = {
            'beta': beta,
            'y_pred': y_pred,
            'r_squared': r_squared,
            'adj_r_squared': adj_r_squared,
            'rmse': rmse,
            'condition_number': cond,
            'x_mean': x_mean,
            'x_std': x_std
        }
    
    return results
 
# Fit models of degree 1, 2, 3, 5
results = fit_and_evaluate(speed, distance, degrees=[1, 2, 3, 5])
 
# Print comparison
print("Polynomial Model Comparison: Stopping Distance")
print("=" * 65)
print(f"{'Degree':>8} | {'R²':>10} | {'Adj R²':>10} | {'RMSE':>10} | {'Cond #':>12}")
print("-" * 65)
for d, r in results.items():
    print(f"{d:>8} | {r['r_squared']:>10.4f} | {r['adj_r_squared']:>10.4f} | "
          f"{r['rmse']:>10.2f} | {r['condition_number']:>12.2e}")
 
print("\nPhysics interpretation:")
print("- Linear (d=1): Ignores kinetic energy relationship")
print("- Quadratic (d=2): Matches physics (KE ~ v²)")
print("- Higher degrees: Overfit noise, worse generalization")

The Physics-Informed Choice

In this example, physics tells us stopping distance ≈ reaction distance (linear in speed) + braking distance (quadratic in speed). The quadratic model has the best adjusted R² because it captures the true relationship without overfitting. Domain knowledge guides optimal degree selection!

Summary: Polynomial Features

We've established the mathematical and computational foundation for polynomial regression. Let's consolidate the key concepts:

Key Takeaways

•Linearity in parameters is what matters—polynomial models are 'linear' because coefficients appear linearly, allowing OLS to work unchanged.
•Feature expansion transforms $x \mapsto [1, x, x^2, \ldots, x^d]$, converting curve-fitting into hyperplane-fitting in higher dimensions.
•Multivariate polynomials include interaction terms; feature count grows combinatorially as $\binom{p+d}{d}$.
•Numerical stability requires attention: center/scale features, use QR decomposition, limit degree, or use orthogonal polynomials.
•The Vandermonde matrix is the design matrix for polynomial regression; its condition number grows rapidly with degree.
•Domain knowledge often suggests appropriate polynomial degree (e.g., physics: quadratic for energy relationships).

What's Next:\n\nNow that we understand how to construct polynomial features, the crucial question becomes: How do we choose the right polynomial degree? Too low misses the pattern; too high overfits the noise. The next page develops rigorous methods for degree selection, including cross-validation, information criteria, and the bias-variance perspective on polynomial complexity.

Page Complete

You now understand the mathematical framework of polynomial features: how they extend linear regression to capture nonlinear patterns while preserving computational tractability. Next, we'll tackle the critical challenge of selecting optimal polynomial degree.

1 / 5

Loading learning content...

Machine LearningAdvanced Regression

Polynomial Regression

LevelIntermediate

Duration90 mins

TopicAdvanced Regression

1 / 5

Polynomial Features

Beyond Linearity: The Need for Polynomial Models

What You Will Learn

The Fundamental Insight: Linearity in Parameters

The Feature Transformation Trick

Polynomial Feature Expansion

polynomial_features.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from typing import Tuple
 
def polynomial_features_1d(x: np.ndarray, degree: int) -> np.ndarray:
    """
    Generate polynomial features for a single variable.
    
    Parameters:
        x: Input array of shape (n_samples,) or (n_samples, 1)
        degree: Maximum polynomial degree
        
    Returns:
        Feature matrix of shape (n_samples, degree + 1)
        Columns are [1, x, x^2, ..., x^degree]
    """
    x = np.asarray(x).flatten()
    n = len(x)
    
    # Build Vandermonde matrix
    # Each column is x raised to the corresponding power
    Phi = np.column_stack([x**k for k in range(degree + 1)])
    
    return Phi
 
def polynomial_regression_fit(x: np.ndarray, y: np.ndarray, 
                               degree: int) -> Tuple[np.ndarray, np.ndarray]:
    """
    Fit polynomial regression using the normal equations.
    
    Parameters:
        x: Input features (n_samples,)
        y: Target values (n_samples,)
        degree: Polynomial degree
        
    Returns:
        beta: Fitted coefficients (degree + 1,)
        Phi: Design matrix (n_samples, degree + 1)
    """
    # Generate polynomial features
    Phi = polynomial_features_1d(x, degree)
    
    # Normal equations: beta = (Phi^T @ Phi)^{-1} @ Phi^T @ y
    # Using numpy's lstsq for numerical stability
    beta, residuals, rank, s = np.linalg.lstsq(Phi, y, rcond=None)
    
    return beta, Phi
 
# Example: Fitting a quadratic to noisy data
np.random.seed(42)
x = np.linspace(-3, 3, 50)
y_true = 2 - 0.5*x + 0.8*x**2  # True quadratic
y = y_true + np.random.normal(0, 0.5, size=len(x))  # Add noise
 
# Fit polynomial of degree 2
beta, Phi = polynomial_regression_fit(x, y, degree=2)
print(f"Fitted coefficients: β₀={beta[0]:.3f}, β₁={beta[1]:.3f}, β₂={beta[2]:.3f}")
print(f"True coefficients:   β₀=2.000, β₁=-0.500, β₂=0.800")

The Vandermonde Matrix

Multivariate Polynomial Features

Number of Polynomial Features: (p variables, degree d)
Variables (p)	Degree 2	Degree 3	Degree 4	Degree 5
1	3	4	5	6
2	6	10	15	21
3	10	20	35	56
5	21	56	126	252
10	66	286	1,001	3,003
20	231	1,771	10,626	53,130

Feature Explosion

multivariate_polynomial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from itertools import combinations_with_replacement
 
def polynomial_features_multi(X: np.ndarray, degree: int, 
                               include_bias: bool = True) -> np.ndarray:
    """
    Generate polynomial features for multiple variables.
    
    Parameters:
        X: Input array of shape (n_samples, n_features)
        degree: Maximum polynomial degree
        include_bias: Whether to include the bias (all-ones) column
        
    Returns:
        Feature matrix with all polynomial terms up to degree d
    """
    n_samples, n_features = X.shape
    
    # Generate all combinations of feature indices with repetition
    # For degree d, we need all combinations of indices summing to at most d
    feature_combinations = []
    
    for d in range(degree + 1):
        # All ways to choose d items from n_features (with replacement)
        for combo in combinations_with_replacement(range(n_features), d):
            feature_combinations.append(combo)
    
    # Build the feature matrix
    n_output_features = len(feature_combinations)
    Phi = np.ones((n_samples, n_output_features))
    
    for i, combo in enumerate(feature_combinations):
        for feature_idx in combo:
            Phi[:, i] *= X[:, feature_idx]
    
    if not include_bias:
        # Remove the first column (all ones)
        Phi = Phi[:, 1:]
    
    return Phi
 
# Example: 2 variables, degree 2
X = np.array([[1, 2],
              [3, 4],
              [5, 6]])
 
Phi = polynomial_features_multi(X, degree=2)
print("Original features X:")
print(X)
print("\nPolynomial features (degree 2):")
print("Columns: [1, x1, x2, x1², x1*x2, x2²]")
print(Phi)

Geometric Interpretation

In Original Space

•Data points: $(x_i, y_i)$
•Model: polynomial curve $y = f(x)$
•Fitting: find curve minimizing vertical distances
•Nonlinear fitting problem (in appearance)

In Feature Space

•Data points: $(\phi(x_i), y_i)$
•Model: hyperplane $y = \boldsymbol{\beta}^T \phi(x)$
•Fitting: find hyperplane minimizing vertical distances
•Linear fitting problem (OLS applies directly)

Computational Considerations

numerical_issues.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def demonstrate_condition_number(x: np.ndarray, degrees: list):
    """
    Show how condition number grows with polynomial degree.
    High condition number = numerical instability.
    """
    print("Polynomial Degree vs. Condition Number")
    print("=" * 45)
    print(f"{'Degree':>8} | {'Condition Number':>20} | {'Status':>12}")
    print("-" * 45)
    
    for d in degrees:
        # Build Vandermonde matrix
        Phi = np.column_stack([x**k for k in range(d + 1)])
        
        # Condition number: ratio of largest to smallest singular value
        cond = np.linalg.cond(Phi)
        
        if cond < 1e3:
            status = "Good"
        elif cond < 1e8:
            status = "Warning"
        elif cond < 1e15:
            status = "Poor"
        else:
            status = "Singular!"
        
        print(f"{d:>8} | {cond:>20.2e} | {status:>12}")
 
# Data spread across [0, 10] - modest range
x = np.linspace(0, 10, 100)
demonstrate_condition_number(x, degrees=[1, 2, 3, 5, 8, 10, 15, 20])

Mitigating Numerical Issues

•Feature Scaling: Center and scale features before computing powers. If $x \in [0, 100]$, transform to $z = (x - 50) / 50 \in [-1, 1]$. Powers of values in $[-1, 1]$ stay bounded.
•Orthogonal Polynomials: Use Legendre, Chebyshev, or Hermite polynomials instead of raw monomials. These are designed to be numerically stable (covered in detail later in this module).
•QR Decomposition: Never compute $(\mathbf{\Phi}^T \mathbf{\Phi})^{-1}$ directly. Use QR decomposition: $\mathbf{\Phi} = \mathbf{Q}\mathbf{R}$, then solve $\mathbf{R}\boldsymbol{\beta} = \mathbf{Q}^T \mathbf{y}$.
•Regularization: Adding $\lambda \mathbf{I}$ to $\mathbf{\Phi}^T \mathbf{\Phi}$ improves conditioning and prevents singular matrices.
•Limit Degree: Rarely use degree > 5-10 for raw polynomial features. Higher degrees almost always require orthogonal bases.

stable_polynomial_fit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from typing import Tuple
 
def stable_polynomial_fit(x: np.ndarray, y: np.ndarray, 
                           degree: int) -> Tuple[np.ndarray, float, float]:
    """
    Numerically stable polynomial fitting using centering/scaling + QR.
    
    Parameters:
        x: Input values
        y: Target values
        degree: Polynomial degree
        
    Returns:
        coefficients: Fitted coefficients in the original (unscaled) basis
        mean: Mean used for centering
        std: Std used for scaling
    """
    # Step 1: Center and scale x
    x_mean = np.mean(x)
    x_std = np.std(x)
    if x_std < 1e-10:
        x_std = 1.0  # Prevent division by zero
    z = (x - x_mean) / x_std
    
    # Step 2: Build Vandermonde matrix with scaled data
    Phi = np.column_stack([z**k for k in range(degree + 1)])
    
    # Step 3: Solve using QR decomposition (more stable than normal equations)
    Q, R = np.linalg.qr(Phi)
    beta_scaled = np.linalg.solve(R, Q.T @ y)
    
    # Step 4: Convert coefficients back to original scale
    # This requires careful bookkeeping...
    # For simplicity, we'll return scaled coefficients and transformation params
    return beta_scaled, x_mean, x_std
 
def predict_stable(x_new: np.ndarray, beta: np.ndarray, 
                   x_mean: float, x_std: float) -> np.ndarray:
    """
    Make predictions using the stable polynomial model.
    """
    z_new = (x_new - x_mean) / x_std
    degree = len(beta) - 1
    Phi_new = np.column_stack([z_new**k for k in range(degree + 1)])
    return Phi_new @ beta
 
# Demonstration
np.random.seed(42)
x = np.linspace(-100, 100, 50)  # Large range - would be unstable without scaling
y = 5 - 0.02*x + 0.001*x**2 + np.random.normal(0, 2, len(x))
 
beta, x_mean, x_std = stable_polynomial_fit(x, y, degree=2)
y_pred = predict_stable(x, beta, x_mean, x_std)
 
print(f"Scaled coefficients: {beta}")
print(f"R² score: {1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2):.6f}")

When to Use Polynomial Features

Polynomial features are a powerful tool, but not always the right choice. Understanding their strengths and limitations helps you choose appropriately.

Good Use Cases

•Low-dimensional data: Few features (1-5) where feature explosion is manageable
•Known polynomial relationships: Physics, engineering with known functional forms
•Smooth, continuous patterns: Data without abrupt changes or discontinuities
•Moderate nonlinearity: Gentle curves rather than complex patterns
•Sufficient data: Enough samples relative to feature count after expansion
•Interpolation focus: Predicting within the range of training data

Poor Use Cases

•High-dimensional data: Many features lead to feature explosion
•Unknown complexity: When the underlying function could be anything
•Discontinuous patterns: Step functions, jumps, regime changes
•Extrapolation needs: Polynomials diverge wildly outside training range
•Limited data: Easy to overfit with many polynomial terms
•Periodic patterns: Use Fourier features instead of polynomials

The Extrapolation Problem

Complete Example: Fitting Motor Vehicle Stopping Distance

stopping_distance_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict
 
# Classic stopping distance data (speed in mph, distance in feet)
# From R's 'cars' dataset, a standard benchmark since 1920s
speed = np.array([4, 7, 8, 9, 10, 11, 12, 13, 13, 14, 14, 14, 14, 15, 15, 15, 
                  16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 
                  20, 20, 22, 23, 24, 24, 24, 24, 25])
distance = np.array([2, 4, 16, 10, 18, 17, 14, 26, 34, 26, 36, 60, 80, 20, 26, 
                     54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 
                     52, 56, 64, 66, 54, 70, 92, 93, 120, 85])
 
def fit_and_evaluate(x: np.ndarray, y: np.ndarray, 
                     degrees: list) -> Dict[int, dict]:
    """
    Fit polynomial models of various degrees and compute metrics.
    """
    # Center and scale for numerical stability
    x_mean, x_std = np.mean(x), np.std(x)
    z = (x - x_mean) / x_std
    
    results = {}
    n = len(y)
    
    for d in degrees:
        # Build design matrix
        Phi = np.column_stack([z**k for k in range(d + 1)])
        
        # Fit using QR decomposition
        Q, R = np.linalg.qr(Phi)
        beta = np.linalg.solve(R, Q.T @ y)
        
        # Predictions and residuals
        y_pred = Phi @ beta
        residuals = y - y_pred
        
        # Metrics
        ss_res = np.sum(residuals**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        r_squared = 1 - ss_res / ss_tot
        
        # Adjusted R² (penalizes for number of parameters)
        p = d + 1  # number of parameters
        adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p - 1)
        
        # Root Mean Square Error
        rmse = np.sqrt(ss_res / n)
        
        # Condition number (numerical stability indicator)
        cond = np.linalg.cond(Phi)
        
        results[d] = {
            'beta': beta,
            'y_pred': y_pred,
            'r_squared': r_squared,
            'adj_r_squared': adj_r_squared,
            'rmse': rmse,
            'condition_number': cond,
            'x_mean': x_mean,
            'x_std': x_std
        }
    
    return results
 
# Fit models of degree 1, 2, 3, 5
results = fit_and_evaluate(speed, distance, degrees=[1, 2, 3, 5])
 
# Print comparison
print("Polynomial Model Comparison: Stopping Distance")
print("=" * 65)
print(f"{'Degree':>8} | {'R²':>10} | {'Adj R²':>10} | {'RMSE':>10} | {'Cond #':>12}")
print("-" * 65)
for d, r in results.items():
    print(f"{d:>8} | {r['r_squared']:>10.4f} | {r['adj_r_squared']:>10.4f} | "
          f"{r['rmse']:>10.2f} | {r['condition_number']:>12.2e}")
 
print("\nPhysics interpretation:")
print("- Linear (d=1): Ignores kinetic energy relationship")
print("- Quadratic (d=2): Matches physics (KE ~ v²)")
print("- Higher degrees: Overfit noise, worse generalization")

The Physics-Informed Choice

Summary: Polynomial Features

We've established the mathematical and computational foundation for polynomial regression. Let's consolidate the key concepts:

Key Takeaways

•Linearity in parameters is what matters—polynomial models are 'linear' because coefficients appear linearly, allowing OLS to work unchanged.
•Feature expansion transforms $x \mapsto [1, x, x^2, \ldots, x^d]$, converting curve-fitting into hyperplane-fitting in higher dimensions.
•Multivariate polynomials include interaction terms; feature count grows combinatorially as $\binom{p+d}{d}$.
•Numerical stability requires attention: center/scale features, use QR decomposition, limit degree, or use orthogonal polynomials.
•The Vandermonde matrix is the design matrix for polynomial regression; its condition number grows rapidly with degree.
•Domain knowledge often suggests appropriate polynomial degree (e.g., physics: quadratic for energy relationships).

Page Complete

1 / 5