Machine LearningAdvanced Regression

Polynomial Regression

LevelIntermediate

Duration90 mins

TopicAdvanced Regression

3 / 5

Overfitting Considerations

The Dark Side of Polynomial Power

Every data scientist has seen it: a polynomial that fits the training data perfectly—passing through every single point—yet fails catastrophically on new data. This is overfitting, and polynomial regression is particularly susceptible to it.

The power that allows polynomials to capture complex patterns also enables them to memorize noise. A degree-$n$ polynomial can perfectly interpolate $n+1$ points, regardless of whether those points represent signal or noise. This flexibility is both a blessing and a curse.

Understanding, detecting, and mitigating overfitting is essential for any practitioner using polynomial models. This page provides a comprehensive treatment of the overfitting problem in polynomial regression.

What You Will Learn

By the end of this page, you will understand: (1) Why polynomial regression is prone to overfitting; (2) Mathematical characterization of overfitting through generalization error; (3) Practical detection methods including learning curves and residual analysis; (4) Regularization as a powerful mitigation strategy; (5) The double descent phenomenon in high-degree polynomials.

Understanding Overfitting in Polynomial Regression

What Is Overfitting?

Overfitting occurs when a model learns patterns in the training data that don't generalize to new data. Instead of learning the underlying relationship $f(x)$, the model memorizes the specific training examples, including their noise.

The Mathematical Signature:

Let $\text{Err}{\text{train}}$ be training error and $\text{Err}{\text{test}}$ be test error (on unseen data). Overfitting manifests as: $$\text{Err}{\text{train}} \ll \text{Err}{\text{test}}$$

The gap between training and test error is the generalization gap—a direct measure of overfitting severity.

Overfitting Signatures in Polynomial Regression
Indicator	Underfitting	Good Fit	Overfitting
Training Error	High	Low-moderate	Very low (near zero)
Test Error	High	Low-moderate	High
Generalization Gap	Small	Small	Large
Coefficient Magnitudes	Small	Moderate	Very large
Prediction Variability	Low	Moderate	High

Why Polynomials Are Particularly Susceptible:

High capacity growth: Adding one degree adds one parameter, but expands representational power dramatically
Runge's phenomenon: Even for smooth functions, high-degree polynomial interpolation can oscillate wildly between data points
Leveraging outliers: High-degree terms amplify the influence of extreme points
Boundary effects: Polynomials can diverge rapidly outside the data range, creating unrealistic predictions
Numerical instability: The Vandermonde matrix becomes ill-conditioned, making coefficients sensitive to small data changes

The Interpolation Trap

A degree $(n-1)$ polynomial can perfectly interpolate $n$ points, achieving zero training error. But this is a trap! Perfect training fit usually means the model is fitting noise, not signal. If your polynomial achieves R² = 1.000, be very suspicious.

Visualizing Overfitting

The most effective way to understand overfitting is to see it. Let's generate synthetic data and fit polynomials of increasing degree.

visualize_overfitting.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_overfitting(n_train: int = 15, n_test: int = 100,
                             noise_std: float = 1.0,
                             degrees: list = [1, 3, 5, 12]):
    """
    Demonstrate overfitting with polynomials of increasing degree.
    """
    np.random.seed(42)
    
    # True underlying function (cubic)
    true_func = lambda x: 0.5 * x**3 - 2 * x**2 + x + 3
    
    # Generate training data (sparse, noisy)
    x_train = np.sort(np.random.uniform(-2, 4, n_train))
    y_train = true_func(x_train) + np.random.normal(0, noise_std, n_train)
    
    # Test data (dense, same noise distribution)
    x_test = np.linspace(-2, 4, n_test)
    y_test = true_func(x_test) + np.random.normal(0, noise_std, n_test)
    
    # Dense grid for plotting smooth curves
    x_plot = np.linspace(-2.5, 4.5, 300)
    y_true_plot = true_func(x_plot)
    
    print("Overfitting Demonstration")
    print("=" * 60)
    print(f"Training samples: {n_train}, Noise std: {noise_std}")
    print(f"True function: 0.5x³ - 2x² + x + 3 (cubic)")
    print("-" * 60)
    print(f"{'Degree':>8} | {'Train MSE':>12} | {'Test MSE':>12} | {'Gap':>12}")
    print("-" * 60)
    
    results = []
    
    for d in degrees:
        # Standardize
        x_mean, x_std = np.mean(x_train), np.std(x_train)
        z_train = (x_train - x_mean) / x_std
        z_test = (x_test - x_mean) / x_std
        z_plot = (x_plot - x_mean) / x_std
        
        # Fit
        Phi_train = np.column_stack([z_train**k for k in range(d + 1)])
        Phi_test = np.column_stack([z_test**k for k in range(d + 1)])
        Phi_plot = np.column_stack([z_plot**k for k in range(d + 1)])
        
        beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
        
        # Predictions
        y_pred_train = Phi_train @ beta
        y_pred_test = Phi_test @ beta
        y_pred_plot = Phi_plot @ beta
        
        # Errors
        train_mse = np.mean((y_train - y_pred_train)**2)
        test_mse = np.mean((y_test - y_pred_test)**2)
        gap = test_mse - train_mse
        
        status = ""
        if gap > 5 * train_mse:
            status = " [OVERFIT!]"
        elif train_mse > 2:
            status = " [UNDERFIT]"
        
        print(f"{d:>8} | {train_mse:>12.4f} | {test_mse:>12.4f} | "
              f"{gap:>12.4f}{status}")
        
        results.append({
            'degree': d,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'y_pred_plot': y_pred_plot,
            'coefficients': beta
        })
    
    return results
 
# Run demonstration
results = demonstrate_overfitting(n_train=15, degrees=[1, 3, 5, 12])
 
# Show coefficient magnitudes for overfitting case
print("
Coefficient Analysis (degree 12 - overfitting):")
print("-" * 40)
high_degree_result = [r for r in results if r['degree'] == 12][0]
coeffs = high_degree_result['coefficients']
print(f"Max |coefficient|: {np.max(np.abs(coeffs)):.2f}")
print(f"Sum |coefficients|: {np.sum(np.abs(coeffs)):.2f}")
print("→ Large coefficients signal overfitting!")

Key Observations

Notice that the degree-12 polynomial achieves near-zero training error but terrible test error. Its coefficients are enormous (often $10^6$ or larger), causing wild oscillations between data points. The degree-3 model, matching the true function's degree, achieves the best test error—low bias without excessive variance.

Learning Curves: Diagnosing Overfitting

What Are Learning Curves?

Learning curves plot model performance (typically MSE) against training set size. They reveal whether a model is suffering from high bias (underfitting) or high variance (overfitting).

Interpretation:

High Bias (Underfitting): Both training and validation errors are high and converge quickly. More data won't help much—the model lacks capacity.
High Variance (Overfitting): Training error is low, validation error is high, with a large gap. More data would help—the model has excess capacity for the available data.
Good Fit: Both errors are low, gap is small, and they converge with more data.

learning_curves.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from typing import List, Tuple
 
def compute_learning_curves(x: np.ndarray, y: np.ndarray,
                             degree: int,
                             train_sizes: List[float] = None,
                             n_splits: int = 5,
                             random_state: int = 42) -> dict:
    """
    Compute learning curves for polynomial regression.
    
    Parameters:
        x, y: Full dataset
        degree: Polynomial degree to evaluate
        train_sizes: Fractions of data to use for training (0-1)
        n_splits: Number of random splits per size
        
    Returns:
        Dictionary with train sizes, train errors, and validation errors
    """
    if train_sizes is None:
        train_sizes = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    
    np.random.seed(random_state)
    n = len(x)
    
    results = {
        'train_sizes': [],
        'train_errors_mean': [],
        'train_errors_std': [],
        'val_errors_mean': [],
        'val_errors_std': []
    }
    
    for frac in train_sizes:
        n_train = max(degree + 2, int(n * frac))  # Need at least degree+2 samples
        
        train_errors = []
        val_errors = []
        
        for split in range(n_splits):
            # Random split
            indices = np.random.permutation(n)
            train_idx = indices[:n_train]
            val_idx = indices[n_train:]
            
            if len(val_idx) < 5:
                continue  # Not enough validation data
            
            x_train, y_train = x[train_idx], y[train_idx]
            x_val, y_val = x[val_idx], y[val_idx]
            
            # Standardize based on training data
            x_mean, x_std = np.mean(x_train), np.std(x_train)
            if x_std < 1e-10: x_std = 1.0
            
            z_train = (x_train - x_mean) / x_std
            z_val = (x_val - x_mean) / x_std
            
            # Fit polynomial
            Phi_train = np.column_stack([z_train**k for k in range(degree + 1)])
            Phi_val = np.column_stack([z_val**k for k in range(degree + 1)])
            
            try:
                beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
                
                y_pred_train = Phi_train @ beta
                y_pred_val = Phi_val @ beta
                
                train_errors.append(np.mean((y_train - y_pred_train)**2))
                val_errors.append(np.mean((y_val - y_pred_val)**2))
            except:
                pass
        
        if train_errors:
            results['train_sizes'].append(n_train)
            results['train_errors_mean'].append(np.mean(train_errors))
            results['train_errors_std'].append(np.std(train_errors))
            results['val_errors_mean'].append(np.mean(val_errors))
            results['val_errors_std'].append(np.std(val_errors))
    
    return results
 
def diagnose_from_learning_curve(lc: dict, degree: int) -> str:
    """Diagnose underfitting vs overfitting from learning curve."""
    
    train_final = lc['train_errors_mean'][-1]
    val_final = lc['val_errors_mean'][-1]
    gap = val_final - train_final
    
    print(f"
Learning Curve Diagnosis (degree {degree}):")
    print("-" * 50)
    print(f"Final training error:   {train_final:.4f}")
    print(f"Final validation error: {val_final:.4f}")
    print(f"Generalization gap:     {gap:.4f}")
    
    # Heuristic diagnosis
    if train_final > 2.0:  # High training error
        diagnosis = "HIGH BIAS (Underfitting)"
        suggestion = "Increase polynomial degree"
    elif gap > 2 * train_final and gap > 0.5:  # Large gap
        diagnosis = "HIGH VARIANCE (Overfitting)"
        suggestion = "Decrease degree, add regularization, or get more data"
    else:
        diagnosis = "GOOD FIT"
        suggestion = "Model appears well-calibrated"
    
    print(f"
Diagnosis: {diagnosis}")
    print(f"Suggestion: {suggestion}")
    
    return diagnosis
 
# Example: Compare high and low degree polynomials
np.random.seed(42)
x = np.linspace(-3, 3, 100)
y_true = np.sin(x) + 0.5 * x
y = y_true + np.random.normal(0, 0.5, len(x))
 
print("Learning Curve Analysis")
print("=" * 60)
 
for degree in [1, 4, 15]:
    lc = compute_learning_curves(x, y, degree=degree)
    diagnose_from_learning_curve(lc, degree)

Reading Learning Curves

Converging curves = good model capacity for data amount Parallel high curves = underfitting (increase capacity) Diverging curves = overfitting (decrease capacity or get more data)

The gap between curves at maximum training size is the key diagnostic—small gaps indicate good generalization.

Regularization: Taming Overfitting

The Key Insight:

Instead of reducing polynomial degree (limiting model capacity), we can keep a high-degree polynomial but penalize large coefficients. This shrinks coefficients toward zero, reducing model complexity without removing terms entirely.

Ridge Regression (L2 Regularization):

$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \arg\min{\boldsymbol{\beta}} \left[ \sum_{i=1}^{n}(y_i - \boldsymbol{\phi}(x_i)^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{d} \beta_j^2 \right]$$

The penalty $\lambda \sum \beta_j^2$ shrinks all coefficients toward zero. Higher $\lambda$ = more regularization = simpler effective model.

Closed-Form Solution:

$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{\Phi}^T \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^T \mathbf{y}$$

regularized_polynomial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import Tuple, List
 
def polynomial_ridge_regression(x: np.ndarray, y: np.ndarray,
                                  degree: int, lambda_reg: float,
                                  standardize: bool = True) -> Tuple[np.ndarray, dict]:
    """
    Fit polynomial regression with L2 (Ridge) regularization.
    
    Parameters:
        x, y: Training data
        degree: Polynomial degree
        lambda_reg: Regularization strength
        standardize: Whether to standardize features
        
    Returns:
        beta: Fitted coefficients
        info: Dictionary with transformation parameters
    """
    n = len(x)
    
    # Standardize
    if standardize:
        x_mean, x_std = np.mean(x), np.std(x)
        if x_std < 1e-10: x_std = 1.0
        z = (x - x_mean) / x_std
    else:
        x_mean, x_std = 0, 1
        z = x
    
    # Build design matrix
    Phi = np.column_stack([z**k for k in range(degree + 1)])
    
    # Ridge solution: beta = (Phi'Phi + lambda*I)^{-1} Phi' y
    # Don't regularize the intercept (first column)
    reg_matrix = lambda_reg * np.eye(degree + 1)
    reg_matrix[0, 0] = 0  # No penalty on intercept
    
    A = Phi.T @ Phi + reg_matrix
    b = Phi.T @ y
    
    beta = np.linalg.solve(A, b)
    
    info = {
        'x_mean': x_mean,
        'x_std': x_std,
        'lambda': lambda_reg
    }
    
    return beta, info
 
def compare_regularization_strengths(x_train: np.ndarray, y_train: np.ndarray,
                                       x_test: np.ndarray, y_test: np.ndarray,
                                       degree: int,
                                       lambdas: List[float]) -> dict:
    """
    Compare different regularization strengths.
    """
    results = []
    
    # Standardize based on training data
    x_mean, x_std = np.mean(x_train), np.std(x_train)
    z_train = (x_train - x_mean) / x_std
    z_test = (x_test - x_mean) / x_std
    
    Phi_train = np.column_stack([z_train**k for k in range(degree + 1)])
    Phi_test = np.column_stack([z_test**k for k in range(degree + 1)])
    
    for lam in lambdas:
        # Ridge solution
        reg_matrix = lam * np.eye(degree + 1)
        reg_matrix[0, 0] = 0
        
        A = Phi_train.T @ Phi_train + reg_matrix
        b = Phi_train.T @ y_train
        beta = np.linalg.solve(A, b)
        
        # Predictions
        y_pred_train = Phi_train @ beta
        y_pred_test = Phi_test @ beta
        
        # Errors
        train_mse = np.mean((y_train - y_pred_train)**2)
        test_mse = np.mean((y_test - y_pred_test)**2)
        
        # Coefficient norm (measure of complexity)
        coef_norm = np.sqrt(np.sum(beta[1:]**2))  # Exclude intercept
        
        results.append({
            'lambda': lam,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'coef_norm': coef_norm,
            'gap': test_mse - train_mse
        })
    
    return results
 
# Demonstration: High-degree polynomial with regularization
np.random.seed(42)
x_train = np.sort(np.random.uniform(-3, 3, 20))
y_true_train = np.sin(x_train) + 0.5 * x_train
y_train = y_true_train + np.random.normal(0, 0.5, len(x_train))
 
x_test = np.linspace(-3, 3, 100)
y_true_test = np.sin(x_test) + 0.5 * x_test
y_test = y_true_test + np.random.normal(0, 0.5, len(x_test))
 
degree = 15  # High degree - would overfit without regularization
lambdas = [0, 0.001, 0.01, 0.1, 1, 10, 100]
 
results = compare_regularization_strengths(
    x_train, y_train, x_test, y_test, degree, lambdas
)
 
print(f"Regularization Effect on Degree-{degree} Polynomial")
print("=" * 70)
print(f"{'Lambda':>10} | {'Train MSE':>10} | {'Test MSE':>10} | "
      f"{'Gap':>10} | {'||β||':>10}")
print("-" * 70)
 
for r in results:
    status = ""
    if r['lambda'] == 0:
        status = "← No reg (overfit)"
    elif r['test_mse'] == min(res['test_mse'] for res in results):
        status = "← OPTIMAL"
    
    print(f"{r['lambda']:>10.3f} | {r['train_mse']:>10.4f} | {r['test_mse']:>10.4f} | "
          f"{r['gap']:>10.4f} | {r['coef_norm']:>10.2f} {status}")

The Regularization Sweet Spot

Notice how λ=0 (unregularized) achieves low training error but high test error—classic overfitting. As λ increases, test error first decreases (reducing overfitting) then increases (underfitting). The optimal λ balances these effects. Cross-validation is typically used to find this sweet spot.

Regularization vs. Degree Selection: Different Philosophies

Approach	Philosophy	Advantages	Disadvantages
Degree Selection	Use minimal terms needed	Simpler models, interpretable	May miss subtle patterns
Regularization	Include many terms, shrink coefficients	Automatic complexity control	Less interpretable, needs tuning
Combined	Moderate degree + regularization	Best of both worlds	Two hyperparameters to tune

In practice, the combined approach often works best: use a moderate degree (3-5) with light regularization as insurance against overfitting.

The Double Descent Phenomenon

A Surprise from Modern ML Theory:

Classical wisdom says test error follows a U-shaped curve as model complexity increases: decreasing (less bias), reaching a minimum, then increasing (more variance). But modern research has revealed a surprising phenomenon: double descent.

The Double Descent Curve:

Underparameterized regime (d << n): Classical U-shape; optimal complexity exists
Interpolation threshold (d ≈ n): Model can perfectly fit training data; test error spikes
Overparameterized regime (d >> n): Test error decreases again!

This means extremely high-degree polynomials can generalize well, contradicting classical intuition.

double_descent.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from typing import List
 
def double_descent_experiment(n_samples: int = 50,
                               noise_std: float = 0.5,
                               degrees: List[int] = None,
                               n_trials: int = 20) -> dict:
    """
    Demonstrate double descent in polynomial regression.
    
    The phenomenon appears when degree approaches and exceeds n_samples.
    """
    if degrees is None:
        # Key: go past the interpolation threshold (n_samples)
        degrees = list(range(1, min(n_samples + 30, 100), 2))
    
    true_func = lambda x: np.sin(2 * np.pi * x) + 0.5 * x
    
    results = {d: {'train_errors': [], 'test_errors': []} for d in degrees}
    
    for trial in range(n_trials):
        np.random.seed(trial)
        
        # Training data
        x_train = np.random.uniform(0, 1, n_samples)
        y_train = true_func(x_train) + np.random.normal(0, noise_std, n_samples)
        
        # Test data (larger, same distribution)
        x_test = np.random.uniform(0, 1, 200)
        y_test = true_func(x_test) + np.random.normal(0, noise_std, 200)
        
        for d in degrees:
            # Use minimum-norm solution for overparameterized case
            Phi_train = np.column_stack([x_train**k for k in range(d + 1)])
            Phi_test = np.column_stack([x_test**k for k in range(d + 1)])
            
            try:
                # lstsq gives minimum-norm solution when d >= n
                beta, residuals, rank, s = np.linalg.lstsq(
                    Phi_train, y_train, rcond=1e-10
                )
                
                y_pred_train = Phi_train @ beta
                y_pred_test = Phi_test @ beta
                
                # Clip extreme predictions for numerical stability
                y_pred_test = np.clip(y_pred_test, -100, 100)
                
                train_mse = np.mean((y_train - y_pred_train)**2)
                test_mse = np.mean((y_test - y_pred_test)**2)
                
                if not np.isnan(test_mse) and test_mse < 1000:
                    results[d]['train_errors'].append(train_mse)
                    results[d]['test_errors'].append(test_mse)
            except:
                pass
    
    # Aggregate
    summary = []
    for d in degrees:
        if results[d]['test_errors']:
            summary.append({
                'degree': d,
                'train_mse': np.mean(results[d]['train_errors']),
                'test_mse': np.median(results[d]['test_errors']),  # Median for robustness
                'test_std': np.std(results[d]['test_errors'])
            })
    
    return summary
 
# Run experiment
n_samples = 30
summary = double_descent_experiment(n_samples=n_samples, noise_std=0.3)
 
print(f"Double Descent Experiment (n = {n_samples})")
print("=" * 60)
print(f"{'Degree':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Regime':>20}")
print("-" * 60)
 
interpolation_threshold = n_samples - 1
 
for s in summary:
    d = s['degree']
    
    if d < interpolation_threshold * 0.6:
        regime = "Underparameterized"
    elif d < interpolation_threshold * 1.2:
        regime = "INTERPOLATION PEAK"
    else:
        regime = "Overparameterized"
    
    print(f"{d:>8} | {s['train_mse']:>10.4f} | {s['test_mse']:>10.4f} | {regime:>20}")
 
print("
Note: Test error spikes near interpolation threshold (d ≈ n-1)")
print("      then decreases in overparameterized regime (d >> n)")

Implications of Double Descent

Why does this happen? In the overparameterized regime, the minimum-norm solution (from lstsq) implicitly regularizes—it finds the simplest (smallest norm) polynomial that interpolates the data. This implicit regularization explains the improving generalization.

Practical implication: The classical 'sweet spot' at moderate complexity isn't the only option. With appropriate (implicit or explicit) regularization, very flexible models can generalize well.

Practical Strategies for Avoiding Overfitting

Based on our analysis, here are comprehensive strategies for avoiding overfitting in polynomial regression:

Overfitting Prevention Checklist

•Start with visualization: Plot data before fitting. Let visible curvature guide initial degree choice.
•Use validation data: Never evaluate on training data alone. Hold out 20-30% for validation, or use cross-validation.
•Monitor the gap: Track training vs. validation error. Large gaps signal overfitting.
•Apply regularization: Even light regularization (λ = 0.01) provides insurance against overfitting.
•Use the one-SE rule: Choose the simplest model within one standard error of minimum CV error.
•Check coefficient magnitudes: Coefficients > 100x the scale of y often indicate overfitting.
•Test boundary behavior: Predict at extremes of x range. Wild values indicate overfitting.
•Prefer orthogonal polynomials: Less numerical instability, better behaved coefficients.
•Consider data augmentation: If possible, collect more data. Overfitting decreases with more samples.
•Use learning curves: Diagnose whether adding more data or reducing complexity would help more.

Signs of Good Generalization

• Training and CV error both low • Small gap between training and CV error • Reasonable predictions at boundaries • Coefficient magnitudes proportional to data scale • Stable coefficients across CV folds

Red Flags for Overfitting

• Near-zero training error (R² ≈ 1.000) • Much higher CV/test error • Coefficients with magnitude >> 1000 • Wild oscillations in fitted curve • Predictions explode outside training range

Summary: Overfitting Considerations

We've provided a comprehensive treatment of overfitting in polynomial regression. Let's consolidate the key insights:

Key Takeaways

•Overfitting manifests as low training error but high test error—the model memorizes noise instead of learning signal.
•Polynomial regression is susceptible due to high capacity growth, Runge's phenomenon, and numerical instability.
•Learning curves diagnose overfitting by revealing gaps between training and validation performance across data sizes.
•Regularization (Ridge, Lasso) provides an alternative to degree selection, penalizing coefficient magnitudes rather than removing terms.
•Double descent shows that very high-degree polynomials can generalize well with implicit regularization from minimum-norm solutions.
•Practical prevention combines visualization, validation, regularization, and monitoring of coefficient magnitudes.

What's Next:

We've seen that polynomial regression with monomials ($1, x, x^2, \ldots$) suffers from numerical instability. The next page introduces orthogonal polynomials—specially designed polynomial bases (Legendre, Chebyshev, Hermite) that are numerically stable, interpretable, and provide better-conditioned regression problems.

Page Complete

You now understand overfitting in polynomial regression deeply: its causes, manifestations, detection methods, and mitigation strategies. This knowledge applies broadly to any model with adjustable complexity.

3 / 5

Loading learning content...

Machine LearningAdvanced Regression

Polynomial Regression

LevelIntermediate

Duration90 mins

TopicAdvanced Regression

3 / 5

Overfitting Considerations

The Dark Side of Polynomial Power

What You Will Learn

Understanding Overfitting in Polynomial Regression

What Is Overfitting?

The Mathematical Signature:

Let $\text{Err}{\text{train}}$ be training error and $\text{Err}{\text{test}}$ be test error (on unseen data). Overfitting manifests as: $$\text{Err}{\text{train}} \ll \text{Err}{\text{test}}$$

The gap between training and test error is the generalization gap—a direct measure of overfitting severity.

Overfitting Signatures in Polynomial Regression
Indicator	Underfitting	Good Fit	Overfitting
Training Error	High	Low-moderate	Very low (near zero)
Test Error	High	Low-moderate	High
Generalization Gap	Small	Small	Large
Coefficient Magnitudes	Small	Moderate	Very large
Prediction Variability	Low	Moderate	High

Why Polynomials Are Particularly Susceptible:

High capacity growth: Adding one degree adds one parameter, but expands representational power dramatically
Runge's phenomenon: Even for smooth functions, high-degree polynomial interpolation can oscillate wildly between data points
Leveraging outliers: High-degree terms amplify the influence of extreme points
Boundary effects: Polynomials can diverge rapidly outside the data range, creating unrealistic predictions
Numerical instability: The Vandermonde matrix becomes ill-conditioned, making coefficients sensitive to small data changes

The Interpolation Trap

Visualizing Overfitting

The most effective way to understand overfitting is to see it. Let's generate synthetic data and fit polynomials of increasing degree.

visualize_overfitting.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_overfitting(n_train: int = 15, n_test: int = 100,
                             noise_std: float = 1.0,
                             degrees: list = [1, 3, 5, 12]):
    """
    Demonstrate overfitting with polynomials of increasing degree.
    """
    np.random.seed(42)
    
    # True underlying function (cubic)
    true_func = lambda x: 0.5 * x**3 - 2 * x**2 + x + 3
    
    # Generate training data (sparse, noisy)
    x_train = np.sort(np.random.uniform(-2, 4, n_train))
    y_train = true_func(x_train) + np.random.normal(0, noise_std, n_train)
    
    # Test data (dense, same noise distribution)
    x_test = np.linspace(-2, 4, n_test)
    y_test = true_func(x_test) + np.random.normal(0, noise_std, n_test)
    
    # Dense grid for plotting smooth curves
    x_plot = np.linspace(-2.5, 4.5, 300)
    y_true_plot = true_func(x_plot)
    
    print("Overfitting Demonstration")
    print("=" * 60)
    print(f"Training samples: {n_train}, Noise std: {noise_std}")
    print(f"True function: 0.5x³ - 2x² + x + 3 (cubic)")
    print("-" * 60)
    print(f"{'Degree':>8} | {'Train MSE':>12} | {'Test MSE':>12} | {'Gap':>12}")
    print("-" * 60)
    
    results = []
    
    for d in degrees:
        # Standardize
        x_mean, x_std = np.mean(x_train), np.std(x_train)
        z_train = (x_train - x_mean) / x_std
        z_test = (x_test - x_mean) / x_std
        z_plot = (x_plot - x_mean) / x_std
        
        # Fit
        Phi_train = np.column_stack([z_train**k for k in range(d + 1)])
        Phi_test = np.column_stack([z_test**k for k in range(d + 1)])
        Phi_plot = np.column_stack([z_plot**k for k in range(d + 1)])
        
        beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
        
        # Predictions
        y_pred_train = Phi_train @ beta
        y_pred_test = Phi_test @ beta
        y_pred_plot = Phi_plot @ beta
        
        # Errors
        train_mse = np.mean((y_train - y_pred_train)**2)
        test_mse = np.mean((y_test - y_pred_test)**2)
        gap = test_mse - train_mse
        
        status = ""
        if gap > 5 * train_mse:
            status = " [OVERFIT!]"
        elif train_mse > 2:
            status = " [UNDERFIT]"
        
        print(f"{d:>8} | {train_mse:>12.4f} | {test_mse:>12.4f} | "
              f"{gap:>12.4f}{status}")
        
        results.append({
            'degree': d,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'y_pred_plot': y_pred_plot,
            'coefficients': beta
        })
    
    return results
 
# Run demonstration
results = demonstrate_overfitting(n_train=15, degrees=[1, 3, 5, 12])
 
# Show coefficient magnitudes for overfitting case
print("
Coefficient Analysis (degree 12 - overfitting):")
print("-" * 40)
high_degree_result = [r for r in results if r['degree'] == 12][0]
coeffs = high_degree_result['coefficients']
print(f"Max |coefficient|: {np.max(np.abs(coeffs)):.2f}")
print(f"Sum |coefficients|: {np.sum(np.abs(coeffs)):.2f}")
print("→ Large coefficients signal overfitting!")

Key Observations

Learning Curves: Diagnosing Overfitting

What Are Learning Curves?

Learning curves plot model performance (typically MSE) against training set size. They reveal whether a model is suffering from high bias (underfitting) or high variance (overfitting).

Interpretation:

High Bias (Underfitting): Both training and validation errors are high and converge quickly. More data won't help much—the model lacks capacity.
High Variance (Overfitting): Training error is low, validation error is high, with a large gap. More data would help—the model has excess capacity for the available data.
Good Fit: Both errors are low, gap is small, and they converge with more data.

learning_curves.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from typing import List, Tuple
 
def compute_learning_curves(x: np.ndarray, y: np.ndarray,
                             degree: int,
                             train_sizes: List[float] = None,
                             n_splits: int = 5,
                             random_state: int = 42) -> dict:
    """
    Compute learning curves for polynomial regression.
    
    Parameters:
        x, y: Full dataset
        degree: Polynomial degree to evaluate
        train_sizes: Fractions of data to use for training (0-1)
        n_splits: Number of random splits per size
        
    Returns:
        Dictionary with train sizes, train errors, and validation errors
    """
    if train_sizes is None:
        train_sizes = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    
    np.random.seed(random_state)
    n = len(x)
    
    results = {
        'train_sizes': [],
        'train_errors_mean': [],
        'train_errors_std': [],
        'val_errors_mean': [],
        'val_errors_std': []
    }
    
    for frac in train_sizes:
        n_train = max(degree + 2, int(n * frac))  # Need at least degree+2 samples
        
        train_errors = []
        val_errors = []
        
        for split in range(n_splits):
            # Random split
            indices = np.random.permutation(n)
            train_idx = indices[:n_train]
            val_idx = indices[n_train:]
            
            if len(val_idx) < 5:
                continue  # Not enough validation data
            
            x_train, y_train = x[train_idx], y[train_idx]
            x_val, y_val = x[val_idx], y[val_idx]
            
            # Standardize based on training data
            x_mean, x_std = np.mean(x_train), np.std(x_train)
            if x_std < 1e-10: x_std = 1.0
            
            z_train = (x_train - x_mean) / x_std
            z_val = (x_val - x_mean) / x_std
            
            # Fit polynomial
            Phi_train = np.column_stack([z_train**k for k in range(degree + 1)])
            Phi_val = np.column_stack([z_val**k for k in range(degree + 1)])
            
            try:
                beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
                
                y_pred_train = Phi_train @ beta
                y_pred_val = Phi_val @ beta
                
                train_errors.append(np.mean((y_train - y_pred_train)**2))
                val_errors.append(np.mean((y_val - y_pred_val)**2))
            except:
                pass
        
        if train_errors:
            results['train_sizes'].append(n_train)
            results['train_errors_mean'].append(np.mean(train_errors))
            results['train_errors_std'].append(np.std(train_errors))
            results['val_errors_mean'].append(np.mean(val_errors))
            results['val_errors_std'].append(np.std(val_errors))
    
    return results
 
def diagnose_from_learning_curve(lc: dict, degree: int) -> str:
    """Diagnose underfitting vs overfitting from learning curve."""
    
    train_final = lc['train_errors_mean'][-1]
    val_final = lc['val_errors_mean'][-1]
    gap = val_final - train_final
    
    print(f"
Learning Curve Diagnosis (degree {degree}):")
    print("-" * 50)
    print(f"Final training error:   {train_final:.4f}")
    print(f"Final validation error: {val_final:.4f}")
    print(f"Generalization gap:     {gap:.4f}")
    
    # Heuristic diagnosis
    if train_final > 2.0:  # High training error
        diagnosis = "HIGH BIAS (Underfitting)"
        suggestion = "Increase polynomial degree"
    elif gap > 2 * train_final and gap > 0.5:  # Large gap
        diagnosis = "HIGH VARIANCE (Overfitting)"
        suggestion = "Decrease degree, add regularization, or get more data"
    else:
        diagnosis = "GOOD FIT"
        suggestion = "Model appears well-calibrated"
    
    print(f"
Diagnosis: {diagnosis}")
    print(f"Suggestion: {suggestion}")
    
    return diagnosis
 
# Example: Compare high and low degree polynomials
np.random.seed(42)
x = np.linspace(-3, 3, 100)
y_true = np.sin(x) + 0.5 * x
y = y_true + np.random.normal(0, 0.5, len(x))
 
print("Learning Curve Analysis")
print("=" * 60)
 
for degree in [1, 4, 15]:
    lc = compute_learning_curves(x, y, degree=degree)
    diagnose_from_learning_curve(lc, degree)

Reading Learning Curves

Converging curves = good model capacity for data amount Parallel high curves = underfitting (increase capacity) Diverging curves = overfitting (decrease capacity or get more data)

The gap between curves at maximum training size is the key diagnostic—small gaps indicate good generalization.

Regularization: Taming Overfitting

The Key Insight:

Ridge Regression (L2 Regularization):

$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \arg\min{\boldsymbol{\beta}} \left[ \sum_{i=1}^{n}(y_i - \boldsymbol{\phi}(x_i)^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{d} \beta_j^2 \right]$$

The penalty $\lambda \sum \beta_j^2$ shrinks all coefficients toward zero. Higher $\lambda$ = more regularization = simpler effective model.

Closed-Form Solution:

$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{\Phi}^T \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^T \mathbf{y}$$

regularized_polynomial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import Tuple, List
 
def polynomial_ridge_regression(x: np.ndarray, y: np.ndarray,
                                  degree: int, lambda_reg: float,
                                  standardize: bool = True) -> Tuple[np.ndarray, dict]:
    """
    Fit polynomial regression with L2 (Ridge) regularization.
    
    Parameters:
        x, y: Training data
        degree: Polynomial degree
        lambda_reg: Regularization strength
        standardize: Whether to standardize features
        
    Returns:
        beta: Fitted coefficients
        info: Dictionary with transformation parameters
    """
    n = len(x)
    
    # Standardize
    if standardize:
        x_mean, x_std = np.mean(x), np.std(x)
        if x_std < 1e-10: x_std = 1.0
        z = (x - x_mean) / x_std
    else:
        x_mean, x_std = 0, 1
        z = x
    
    # Build design matrix
    Phi = np.column_stack([z**k for k in range(degree + 1)])
    
    # Ridge solution: beta = (Phi'Phi + lambda*I)^{-1} Phi' y
    # Don't regularize the intercept (first column)
    reg_matrix = lambda_reg * np.eye(degree + 1)
    reg_matrix[0, 0] = 0  # No penalty on intercept
    
    A = Phi.T @ Phi + reg_matrix
    b = Phi.T @ y
    
    beta = np.linalg.solve(A, b)
    
    info = {
        'x_mean': x_mean,
        'x_std': x_std,
        'lambda': lambda_reg
    }
    
    return beta, info
 
def compare_regularization_strengths(x_train: np.ndarray, y_train: np.ndarray,
                                       x_test: np.ndarray, y_test: np.ndarray,
                                       degree: int,
                                       lambdas: List[float]) -> dict:
    """
    Compare different regularization strengths.
    """
    results = []
    
    # Standardize based on training data
    x_mean, x_std = np.mean(x_train), np.std(x_train)
    z_train = (x_train - x_mean) / x_std
    z_test = (x_test - x_mean) / x_std
    
    Phi_train = np.column_stack([z_train**k for k in range(degree + 1)])
    Phi_test = np.column_stack([z_test**k for k in range(degree + 1)])
    
    for lam in lambdas:
        # Ridge solution
        reg_matrix = lam * np.eye(degree + 1)
        reg_matrix[0, 0] = 0
        
        A = Phi_train.T @ Phi_train + reg_matrix
        b = Phi_train.T @ y_train
        beta = np.linalg.solve(A, b)
        
        # Predictions
        y_pred_train = Phi_train @ beta
        y_pred_test = Phi_test @ beta
        
        # Errors
        train_mse = np.mean((y_train - y_pred_train)**2)
        test_mse = np.mean((y_test - y_pred_test)**2)
        
        # Coefficient norm (measure of complexity)
        coef_norm = np.sqrt(np.sum(beta[1:]**2))  # Exclude intercept
        
        results.append({
            'lambda': lam,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'coef_norm': coef_norm,
            'gap': test_mse - train_mse
        })
    
    return results
 
# Demonstration: High-degree polynomial with regularization
np.random.seed(42)
x_train = np.sort(np.random.uniform(-3, 3, 20))
y_true_train = np.sin(x_train) + 0.5 * x_train
y_train = y_true_train + np.random.normal(0, 0.5, len(x_train))
 
x_test = np.linspace(-3, 3, 100)
y_true_test = np.sin(x_test) + 0.5 * x_test
y_test = y_true_test + np.random.normal(0, 0.5, len(x_test))
 
degree = 15  # High degree - would overfit without regularization
lambdas = [0, 0.001, 0.01, 0.1, 1, 10, 100]
 
results = compare_regularization_strengths(
    x_train, y_train, x_test, y_test, degree, lambdas
)
 
print(f"Regularization Effect on Degree-{degree} Polynomial")
print("=" * 70)
print(f"{'Lambda':>10} | {'Train MSE':>10} | {'Test MSE':>10} | "
      f"{'Gap':>10} | {'||β||':>10}")
print("-" * 70)
 
for r in results:
    status = ""
    if r['lambda'] == 0:
        status = "← No reg (overfit)"
    elif r['test_mse'] == min(res['test_mse'] for res in results):
        status = "← OPTIMAL"
    
    print(f"{r['lambda']:>10.3f} | {r['train_mse']:>10.4f} | {r['test_mse']:>10.4f} | "
          f"{r['gap']:>10.4f} | {r['coef_norm']:>10.2f} {status}")

The Regularization Sweet Spot

Regularization vs. Degree Selection: Different Philosophies

Approach	Philosophy	Advantages	Disadvantages
Degree Selection	Use minimal terms needed	Simpler models, interpretable	May miss subtle patterns
Regularization	Include many terms, shrink coefficients	Automatic complexity control	Less interpretable, needs tuning
Combined	Moderate degree + regularization	Best of both worlds	Two hyperparameters to tune

In practice, the combined approach often works best: use a moderate degree (3-5) with light regularization as insurance against overfitting.

The Double Descent Phenomenon

A Surprise from Modern ML Theory:

The Double Descent Curve:

Underparameterized regime (d << n): Classical U-shape; optimal complexity exists
Interpolation threshold (d ≈ n): Model can perfectly fit training data; test error spikes
Overparameterized regime (d >> n): Test error decreases again!

This means extremely high-degree polynomials can generalize well, contradicting classical intuition.

double_descent.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from typing import List
 
def double_descent_experiment(n_samples: int = 50,
                               noise_std: float = 0.5,
                               degrees: List[int] = None,
                               n_trials: int = 20) -> dict:
    """
    Demonstrate double descent in polynomial regression.
    
    The phenomenon appears when degree approaches and exceeds n_samples.
    """
    if degrees is None:
        # Key: go past the interpolation threshold (n_samples)
        degrees = list(range(1, min(n_samples + 30, 100), 2))
    
    true_func = lambda x: np.sin(2 * np.pi * x) + 0.5 * x
    
    results = {d: {'train_errors': [], 'test_errors': []} for d in degrees}
    
    for trial in range(n_trials):
        np.random.seed(trial)
        
        # Training data
        x_train = np.random.uniform(0, 1, n_samples)
        y_train = true_func(x_train) + np.random.normal(0, noise_std, n_samples)
        
        # Test data (larger, same distribution)
        x_test = np.random.uniform(0, 1, 200)
        y_test = true_func(x_test) + np.random.normal(0, noise_std, 200)
        
        for d in degrees:
            # Use minimum-norm solution for overparameterized case
            Phi_train = np.column_stack([x_train**k for k in range(d + 1)])
            Phi_test = np.column_stack([x_test**k for k in range(d + 1)])
            
            try:
                # lstsq gives minimum-norm solution when d >= n
                beta, residuals, rank, s = np.linalg.lstsq(
                    Phi_train, y_train, rcond=1e-10
                )
                
                y_pred_train = Phi_train @ beta
                y_pred_test = Phi_test @ beta
                
                # Clip extreme predictions for numerical stability
                y_pred_test = np.clip(y_pred_test, -100, 100)
                
                train_mse = np.mean((y_train - y_pred_train)**2)
                test_mse = np.mean((y_test - y_pred_test)**2)
                
                if not np.isnan(test_mse) and test_mse < 1000:
                    results[d]['train_errors'].append(train_mse)
                    results[d]['test_errors'].append(test_mse)
            except:
                pass
    
    # Aggregate
    summary = []
    for d in degrees:
        if results[d]['test_errors']:
            summary.append({
                'degree': d,
                'train_mse': np.mean(results[d]['train_errors']),
                'test_mse': np.median(results[d]['test_errors']),  # Median for robustness
                'test_std': np.std(results[d]['test_errors'])
            })
    
    return summary
 
# Run experiment
n_samples = 30
summary = double_descent_experiment(n_samples=n_samples, noise_std=0.3)
 
print(f"Double Descent Experiment (n = {n_samples})")
print("=" * 60)
print(f"{'Degree':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Regime':>20}")
print("-" * 60)
 
interpolation_threshold = n_samples - 1
 
for s in summary:
    d = s['degree']
    
    if d < interpolation_threshold * 0.6:
        regime = "Underparameterized"
    elif d < interpolation_threshold * 1.2:
        regime = "INTERPOLATION PEAK"
    else:
        regime = "Overparameterized"
    
    print(f"{d:>8} | {s['train_mse']:>10.4f} | {s['test_mse']:>10.4f} | {regime:>20}")
 
print("
Note: Test error spikes near interpolation threshold (d ≈ n-1)")
print("      then decreases in overparameterized regime (d >> n)")

Implications of Double Descent

Practical implication: The classical 'sweet spot' at moderate complexity isn't the only option. With appropriate (implicit or explicit) regularization, very flexible models can generalize well.

Practical Strategies for Avoiding Overfitting

Based on our analysis, here are comprehensive strategies for avoiding overfitting in polynomial regression:

Overfitting Prevention Checklist

•Start with visualization: Plot data before fitting. Let visible curvature guide initial degree choice.
•Use validation data: Never evaluate on training data alone. Hold out 20-30% for validation, or use cross-validation.
•Monitor the gap: Track training vs. validation error. Large gaps signal overfitting.
•Apply regularization: Even light regularization (λ = 0.01) provides insurance against overfitting.
•Use the one-SE rule: Choose the simplest model within one standard error of minimum CV error.
•Check coefficient magnitudes: Coefficients > 100x the scale of y often indicate overfitting.
•Test boundary behavior: Predict at extremes of x range. Wild values indicate overfitting.
•Prefer orthogonal polynomials: Less numerical instability, better behaved coefficients.
•Consider data augmentation: If possible, collect more data. Overfitting decreases with more samples.
•Use learning curves: Diagnose whether adding more data or reducing complexity would help more.

Signs of Good Generalization

Red Flags for Overfitting

• Near-zero training error (R² ≈ 1.000) • Much higher CV/test error • Coefficients with magnitude >> 1000 • Wild oscillations in fitted curve • Predictions explode outside training range

Summary: Overfitting Considerations

We've provided a comprehensive treatment of overfitting in polynomial regression. Let's consolidate the key insights:

Key Takeaways

•Overfitting manifests as low training error but high test error—the model memorizes noise instead of learning signal.
•Polynomial regression is susceptible due to high capacity growth, Runge's phenomenon, and numerical instability.
•Learning curves diagnose overfitting by revealing gaps between training and validation performance across data sizes.
•Regularization (Ridge, Lasso) provides an alternative to degree selection, penalizing coefficient magnitudes rather than removing terms.
•Double descent shows that very high-degree polynomials can generalize well with implicit regularization from minimum-norm solutions.
•Practical prevention combines visualization, validation, regularization, and monitoring of coefficient magnitudes.

What's Next:

Page Complete

3 / 5