Loading learning content...
Every data scientist has seen it: a polynomial that fits the training data perfectly—passing through every single point—yet fails catastrophically on new data. This is overfitting, and polynomial regression is particularly susceptible to it.
The power that allows polynomials to capture complex patterns also enables them to memorize noise. A degree-$n$ polynomial can perfectly interpolate $n+1$ points, regardless of whether those points represent signal or noise. This flexibility is both a blessing and a curse.
Understanding, detecting, and mitigating overfitting is essential for any practitioner using polynomial models. This page provides a comprehensive treatment of the overfitting problem in polynomial regression.
By the end of this page, you will understand: (1) Why polynomial regression is prone to overfitting; (2) Mathematical characterization of overfitting through generalization error; (3) Practical detection methods including learning curves and residual analysis; (4) Regularization as a powerful mitigation strategy; (5) The double descent phenomenon in high-degree polynomials.
What Is Overfitting?
Overfitting occurs when a model learns patterns in the training data that don't generalize to new data. Instead of learning the underlying relationship $f(x)$, the model memorizes the specific training examples, including their noise.
The Mathematical Signature:
Let $\text{Err}{\text{train}}$ be training error and $\text{Err}{\text{test}}$ be test error (on unseen data). Overfitting manifests as: $$\text{Err}{\text{train}} \ll \text{Err}{\text{test}}$$
The gap between training and test error is the generalization gap—a direct measure of overfitting severity.
| Indicator | Underfitting | Good Fit | Overfitting |
|---|---|---|---|
| Training Error | High | Low-moderate | Very low (near zero) |
| Test Error | High | Low-moderate | High |
| Generalization Gap | Small | Small | Large |
| Coefficient Magnitudes | Small | Moderate | Very large |
| Prediction Variability | Low | Moderate | High |
Why Polynomials Are Particularly Susceptible:
High capacity growth: Adding one degree adds one parameter, but expands representational power dramatically
Runge's phenomenon: Even for smooth functions, high-degree polynomial interpolation can oscillate wildly between data points
Leveraging outliers: High-degree terms amplify the influence of extreme points
Boundary effects: Polynomials can diverge rapidly outside the data range, creating unrealistic predictions
Numerical instability: The Vandermonde matrix becomes ill-conditioned, making coefficients sensitive to small data changes
A degree $(n-1)$ polynomial can perfectly interpolate $n$ points, achieving zero training error. But this is a trap! Perfect training fit usually means the model is fitting noise, not signal. If your polynomial achieves R² = 1.000, be very suspicious.
The most effective way to understand overfitting is to see it. Let's generate synthetic data and fit polynomials of increasing degree.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npimport matplotlib.pyplot as plt def demonstrate_overfitting(n_train: int = 15, n_test: int = 100, noise_std: float = 1.0, degrees: list = [1, 3, 5, 12]): """ Demonstrate overfitting with polynomials of increasing degree. """ np.random.seed(42) # True underlying function (cubic) true_func = lambda x: 0.5 * x**3 - 2 * x**2 + x + 3 # Generate training data (sparse, noisy) x_train = np.sort(np.random.uniform(-2, 4, n_train)) y_train = true_func(x_train) + np.random.normal(0, noise_std, n_train) # Test data (dense, same noise distribution) x_test = np.linspace(-2, 4, n_test) y_test = true_func(x_test) + np.random.normal(0, noise_std, n_test) # Dense grid for plotting smooth curves x_plot = np.linspace(-2.5, 4.5, 300) y_true_plot = true_func(x_plot) print("Overfitting Demonstration") print("=" * 60) print(f"Training samples: {n_train}, Noise std: {noise_std}") print(f"True function: 0.5x³ - 2x² + x + 3 (cubic)") print("-" * 60) print(f"{'Degree':>8} | {'Train MSE':>12} | {'Test MSE':>12} | {'Gap':>12}") print("-" * 60) results = [] for d in degrees: # Standardize x_mean, x_std = np.mean(x_train), np.std(x_train) z_train = (x_train - x_mean) / x_std z_test = (x_test - x_mean) / x_std z_plot = (x_plot - x_mean) / x_std # Fit Phi_train = np.column_stack([z_train**k for k in range(d + 1)]) Phi_test = np.column_stack([z_test**k for k in range(d + 1)]) Phi_plot = np.column_stack([z_plot**k for k in range(d + 1)]) beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0] # Predictions y_pred_train = Phi_train @ beta y_pred_test = Phi_test @ beta y_pred_plot = Phi_plot @ beta # Errors train_mse = np.mean((y_train - y_pred_train)**2) test_mse = np.mean((y_test - y_pred_test)**2) gap = test_mse - train_mse status = "" if gap > 5 * train_mse: status = " [OVERFIT!]" elif train_mse > 2: status = " [UNDERFIT]" print(f"{d:>8} | {train_mse:>12.4f} | {test_mse:>12.4f} | " f"{gap:>12.4f}{status}") results.append({ 'degree': d, 'train_mse': train_mse, 'test_mse': test_mse, 'y_pred_plot': y_pred_plot, 'coefficients': beta }) return results # Run demonstrationresults = demonstrate_overfitting(n_train=15, degrees=[1, 3, 5, 12]) # Show coefficient magnitudes for overfitting caseprint("Coefficient Analysis (degree 12 - overfitting):")print("-" * 40)high_degree_result = [r for r in results if r['degree'] == 12][0]coeffs = high_degree_result['coefficients']print(f"Max |coefficient|: {np.max(np.abs(coeffs)):.2f}")print(f"Sum |coefficients|: {np.sum(np.abs(coeffs)):.2f}")print("→ Large coefficients signal overfitting!")Notice that the degree-12 polynomial achieves near-zero training error but terrible test error. Its coefficients are enormous (often $10^6$ or larger), causing wild oscillations between data points. The degree-3 model, matching the true function's degree, achieves the best test error—low bias without excessive variance.
What Are Learning Curves?
Learning curves plot model performance (typically MSE) against training set size. They reveal whether a model is suffering from high bias (underfitting) or high variance (overfitting).
Interpretation:
High Bias (Underfitting): Both training and validation errors are high and converge quickly. More data won't help much—the model lacks capacity.
High Variance (Overfitting): Training error is low, validation error is high, with a large gap. More data would help—the model has excess capacity for the available data.
Good Fit: Both errors are low, gap is small, and they converge with more data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npfrom typing import List, Tuple def compute_learning_curves(x: np.ndarray, y: np.ndarray, degree: int, train_sizes: List[float] = None, n_splits: int = 5, random_state: int = 42) -> dict: """ Compute learning curves for polynomial regression. Parameters: x, y: Full dataset degree: Polynomial degree to evaluate train_sizes: Fractions of data to use for training (0-1) n_splits: Number of random splits per size Returns: Dictionary with train sizes, train errors, and validation errors """ if train_sizes is None: train_sizes = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] np.random.seed(random_state) n = len(x) results = { 'train_sizes': [], 'train_errors_mean': [], 'train_errors_std': [], 'val_errors_mean': [], 'val_errors_std': [] } for frac in train_sizes: n_train = max(degree + 2, int(n * frac)) # Need at least degree+2 samples train_errors = [] val_errors = [] for split in range(n_splits): # Random split indices = np.random.permutation(n) train_idx = indices[:n_train] val_idx = indices[n_train:] if len(val_idx) < 5: continue # Not enough validation data x_train, y_train = x[train_idx], y[train_idx] x_val, y_val = x[val_idx], y[val_idx] # Standardize based on training data x_mean, x_std = np.mean(x_train), np.std(x_train) if x_std < 1e-10: x_std = 1.0 z_train = (x_train - x_mean) / x_std z_val = (x_val - x_mean) / x_std # Fit polynomial Phi_train = np.column_stack([z_train**k for k in range(degree + 1)]) Phi_val = np.column_stack([z_val**k for k in range(degree + 1)]) try: beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0] y_pred_train = Phi_train @ beta y_pred_val = Phi_val @ beta train_errors.append(np.mean((y_train - y_pred_train)**2)) val_errors.append(np.mean((y_val - y_pred_val)**2)) except: pass if train_errors: results['train_sizes'].append(n_train) results['train_errors_mean'].append(np.mean(train_errors)) results['train_errors_std'].append(np.std(train_errors)) results['val_errors_mean'].append(np.mean(val_errors)) results['val_errors_std'].append(np.std(val_errors)) return results def diagnose_from_learning_curve(lc: dict, degree: int) -> str: """Diagnose underfitting vs overfitting from learning curve.""" train_final = lc['train_errors_mean'][-1] val_final = lc['val_errors_mean'][-1] gap = val_final - train_final print(f"Learning Curve Diagnosis (degree {degree}):") print("-" * 50) print(f"Final training error: {train_final:.4f}") print(f"Final validation error: {val_final:.4f}") print(f"Generalization gap: {gap:.4f}") # Heuristic diagnosis if train_final > 2.0: # High training error diagnosis = "HIGH BIAS (Underfitting)" suggestion = "Increase polynomial degree" elif gap > 2 * train_final and gap > 0.5: # Large gap diagnosis = "HIGH VARIANCE (Overfitting)" suggestion = "Decrease degree, add regularization, or get more data" else: diagnosis = "GOOD FIT" suggestion = "Model appears well-calibrated" print(f"Diagnosis: {diagnosis}") print(f"Suggestion: {suggestion}") return diagnosis # Example: Compare high and low degree polynomialsnp.random.seed(42)x = np.linspace(-3, 3, 100)y_true = np.sin(x) + 0.5 * xy = y_true + np.random.normal(0, 0.5, len(x)) print("Learning Curve Analysis")print("=" * 60) for degree in [1, 4, 15]: lc = compute_learning_curves(x, y, degree=degree) diagnose_from_learning_curve(lc, degree)Converging curves = good model capacity for data amount Parallel high curves = underfitting (increase capacity) Diverging curves = overfitting (decrease capacity or get more data)
The gap between curves at maximum training size is the key diagnostic—small gaps indicate good generalization.
The Key Insight:
Instead of reducing polynomial degree (limiting model capacity), we can keep a high-degree polynomial but penalize large coefficients. This shrinks coefficients toward zero, reducing model complexity without removing terms entirely.
Ridge Regression (L2 Regularization):
$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \arg\min{\boldsymbol{\beta}} \left[ \sum_{i=1}^{n}(y_i - \boldsymbol{\phi}(x_i)^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{d} \beta_j^2 \right]$$
The penalty $\lambda \sum \beta_j^2$ shrinks all coefficients toward zero. Higher $\lambda$ = more regularization = simpler effective model.
Closed-Form Solution:
$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{\Phi}^T \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^T \mathbf{y}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom typing import Tuple, List def polynomial_ridge_regression(x: np.ndarray, y: np.ndarray, degree: int, lambda_reg: float, standardize: bool = True) -> Tuple[np.ndarray, dict]: """ Fit polynomial regression with L2 (Ridge) regularization. Parameters: x, y: Training data degree: Polynomial degree lambda_reg: Regularization strength standardize: Whether to standardize features Returns: beta: Fitted coefficients info: Dictionary with transformation parameters """ n = len(x) # Standardize if standardize: x_mean, x_std = np.mean(x), np.std(x) if x_std < 1e-10: x_std = 1.0 z = (x - x_mean) / x_std else: x_mean, x_std = 0, 1 z = x # Build design matrix Phi = np.column_stack([z**k for k in range(degree + 1)]) # Ridge solution: beta = (Phi'Phi + lambda*I)^{-1} Phi' y # Don't regularize the intercept (first column) reg_matrix = lambda_reg * np.eye(degree + 1) reg_matrix[0, 0] = 0 # No penalty on intercept A = Phi.T @ Phi + reg_matrix b = Phi.T @ y beta = np.linalg.solve(A, b) info = { 'x_mean': x_mean, 'x_std': x_std, 'lambda': lambda_reg } return beta, info def compare_regularization_strengths(x_train: np.ndarray, y_train: np.ndarray, x_test: np.ndarray, y_test: np.ndarray, degree: int, lambdas: List[float]) -> dict: """ Compare different regularization strengths. """ results = [] # Standardize based on training data x_mean, x_std = np.mean(x_train), np.std(x_train) z_train = (x_train - x_mean) / x_std z_test = (x_test - x_mean) / x_std Phi_train = np.column_stack([z_train**k for k in range(degree + 1)]) Phi_test = np.column_stack([z_test**k for k in range(degree + 1)]) for lam in lambdas: # Ridge solution reg_matrix = lam * np.eye(degree + 1) reg_matrix[0, 0] = 0 A = Phi_train.T @ Phi_train + reg_matrix b = Phi_train.T @ y_train beta = np.linalg.solve(A, b) # Predictions y_pred_train = Phi_train @ beta y_pred_test = Phi_test @ beta # Errors train_mse = np.mean((y_train - y_pred_train)**2) test_mse = np.mean((y_test - y_pred_test)**2) # Coefficient norm (measure of complexity) coef_norm = np.sqrt(np.sum(beta[1:]**2)) # Exclude intercept results.append({ 'lambda': lam, 'train_mse': train_mse, 'test_mse': test_mse, 'coef_norm': coef_norm, 'gap': test_mse - train_mse }) return results # Demonstration: High-degree polynomial with regularizationnp.random.seed(42)x_train = np.sort(np.random.uniform(-3, 3, 20))y_true_train = np.sin(x_train) + 0.5 * x_trainy_train = y_true_train + np.random.normal(0, 0.5, len(x_train)) x_test = np.linspace(-3, 3, 100)y_true_test = np.sin(x_test) + 0.5 * x_testy_test = y_true_test + np.random.normal(0, 0.5, len(x_test)) degree = 15 # High degree - would overfit without regularizationlambdas = [0, 0.001, 0.01, 0.1, 1, 10, 100] results = compare_regularization_strengths( x_train, y_train, x_test, y_test, degree, lambdas) print(f"Regularization Effect on Degree-{degree} Polynomial")print("=" * 70)print(f"{'Lambda':>10} | {'Train MSE':>10} | {'Test MSE':>10} | " f"{'Gap':>10} | {'||β||':>10}")print("-" * 70) for r in results: status = "" if r['lambda'] == 0: status = "← No reg (overfit)" elif r['test_mse'] == min(res['test_mse'] for res in results): status = "← OPTIMAL" print(f"{r['lambda']:>10.3f} | {r['train_mse']:>10.4f} | {r['test_mse']:>10.4f} | " f"{r['gap']:>10.4f} | {r['coef_norm']:>10.2f} {status}")Notice how λ=0 (unregularized) achieves low training error but high test error—classic overfitting. As λ increases, test error first decreases (reducing overfitting) then increases (underfitting). The optimal λ balances these effects. Cross-validation is typically used to find this sweet spot.
Regularization vs. Degree Selection: Different Philosophies
| Approach | Philosophy | Advantages | Disadvantages |
|---|---|---|---|
| Degree Selection | Use minimal terms needed | Simpler models, interpretable | May miss subtle patterns |
| Regularization | Include many terms, shrink coefficients | Automatic complexity control | Less interpretable, needs tuning |
| Combined | Moderate degree + regularization | Best of both worlds | Two hyperparameters to tune |
In practice, the combined approach often works best: use a moderate degree (3-5) with light regularization as insurance against overfitting.
A Surprise from Modern ML Theory:
Classical wisdom says test error follows a U-shaped curve as model complexity increases: decreasing (less bias), reaching a minimum, then increasing (more variance). But modern research has revealed a surprising phenomenon: double descent.
The Double Descent Curve:
This means extremely high-degree polynomials can generalize well, contradicting classical intuition.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom typing import List def double_descent_experiment(n_samples: int = 50, noise_std: float = 0.5, degrees: List[int] = None, n_trials: int = 20) -> dict: """ Demonstrate double descent in polynomial regression. The phenomenon appears when degree approaches and exceeds n_samples. """ if degrees is None: # Key: go past the interpolation threshold (n_samples) degrees = list(range(1, min(n_samples + 30, 100), 2)) true_func = lambda x: np.sin(2 * np.pi * x) + 0.5 * x results = {d: {'train_errors': [], 'test_errors': []} for d in degrees} for trial in range(n_trials): np.random.seed(trial) # Training data x_train = np.random.uniform(0, 1, n_samples) y_train = true_func(x_train) + np.random.normal(0, noise_std, n_samples) # Test data (larger, same distribution) x_test = np.random.uniform(0, 1, 200) y_test = true_func(x_test) + np.random.normal(0, noise_std, 200) for d in degrees: # Use minimum-norm solution for overparameterized case Phi_train = np.column_stack([x_train**k for k in range(d + 1)]) Phi_test = np.column_stack([x_test**k for k in range(d + 1)]) try: # lstsq gives minimum-norm solution when d >= n beta, residuals, rank, s = np.linalg.lstsq( Phi_train, y_train, rcond=1e-10 ) y_pred_train = Phi_train @ beta y_pred_test = Phi_test @ beta # Clip extreme predictions for numerical stability y_pred_test = np.clip(y_pred_test, -100, 100) train_mse = np.mean((y_train - y_pred_train)**2) test_mse = np.mean((y_test - y_pred_test)**2) if not np.isnan(test_mse) and test_mse < 1000: results[d]['train_errors'].append(train_mse) results[d]['test_errors'].append(test_mse) except: pass # Aggregate summary = [] for d in degrees: if results[d]['test_errors']: summary.append({ 'degree': d, 'train_mse': np.mean(results[d]['train_errors']), 'test_mse': np.median(results[d]['test_errors']), # Median for robustness 'test_std': np.std(results[d]['test_errors']) }) return summary # Run experimentn_samples = 30summary = double_descent_experiment(n_samples=n_samples, noise_std=0.3) print(f"Double Descent Experiment (n = {n_samples})")print("=" * 60)print(f"{'Degree':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Regime':>20}")print("-" * 60) interpolation_threshold = n_samples - 1 for s in summary: d = s['degree'] if d < interpolation_threshold * 0.6: regime = "Underparameterized" elif d < interpolation_threshold * 1.2: regime = "INTERPOLATION PEAK" else: regime = "Overparameterized" print(f"{d:>8} | {s['train_mse']:>10.4f} | {s['test_mse']:>10.4f} | {regime:>20}") print("Note: Test error spikes near interpolation threshold (d ≈ n-1)")print(" then decreases in overparameterized regime (d >> n)")Why does this happen? In the overparameterized regime, the minimum-norm solution (from lstsq) implicitly regularizes—it finds the simplest (smallest norm) polynomial that interpolates the data. This implicit regularization explains the improving generalization.
Practical implication: The classical 'sweet spot' at moderate complexity isn't the only option. With appropriate (implicit or explicit) regularization, very flexible models can generalize well.
Based on our analysis, here are comprehensive strategies for avoiding overfitting in polynomial regression:
• Training and CV error both low • Small gap between training and CV error • Reasonable predictions at boundaries • Coefficient magnitudes proportional to data scale • Stable coefficients across CV folds
• Near-zero training error (R² ≈ 1.000) • Much higher CV/test error • Coefficients with magnitude >> 1000 • Wild oscillations in fitted curve • Predictions explode outside training range
We've provided a comprehensive treatment of overfitting in polynomial regression. Let's consolidate the key insights:
What's Next:
We've seen that polynomial regression with monomials ($1, x, x^2, \ldots$) suffers from numerical instability. The next page introduces orthogonal polynomials—specially designed polynomial bases (Legendre, Chebyshev, Hermite) that are numerically stable, interpretable, and provide better-conditioned regression problems.
You now understand overfitting in polynomial regression deeply: its causes, manifestations, detection methods, and mitigation strategies. This knowledge applies broadly to any model with adjustable complexity.