Polynomial Regression - Learning Module

Loading content...

0/278

Degree Selection

The Goldilocks Problem of Polynomial Regression

In polynomial regression, the degree $d$ is the most critical hyperparameter. It controls the model's capacity to capture complex patterns—but as with Goldilocks, we need the choice to be 'just right.'

Consider fitting a polynomial to noisy data:

Degree too low (underfitting): The polynomial can't capture the true relationship. High bias, low variance.
Degree too high (overfitting): The polynomial contorts to pass through every noise point. Low bias, high variance.
Degree just right: Captures the true pattern without fitting noise. Optimal bias-variance trade-off.

This page develops rigorous, principled methods for finding that optimal degree—moving beyond intuition to systematic model selection.

What You Will Learn

By the end of this page, you will master: (1) Cross-validation for degree selection with proper implementation; (2) Information criteria (AIC, BIC) and their mathematical foundations; (3) Hypothesis testing for comparing polynomial models; (4) The bias-variance decomposition applied to polynomial degree; (5) Practical heuristics and rules of thumb.

The Bias-Variance Perspective on Polynomial Degree

The Expected Prediction Error:

For a new test point $x_0$ with true value $y_0 = f(x_0) + \epsilon$ (where $\epsilon$ is noise with variance $\sigma^2$), the expected squared error can be decomposed:

$$\mathbb{E}[(y_0 - \hat{f}(x_0))^2] = \text{Bias}^2[\hat{f}(x_0)] + \text{Var}[\hat{f}(x_0)] + \sigma^2$$

As polynomial degree $d$ increases:

Component	Effect of Increasing $d$
Bias²	Decreases (more flexibility to match true $f$)
Variance	Increases (more parameters → more sensitivity to training noise)
Irreducible error $\sigma^2$	Unchanged (inherent noise)

The optimal degree minimizes the sum Bias² + Variance, not either component alone.

bias_variance_polynomial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List
 
def bias_variance_simulation(true_func, x_range: Tuple[float, float],
                              n_train: int, noise_std: float,
                              degrees: List[int], n_simulations: int = 200,
                              n_test: int = 50) -> dict:
    """
    Empirically estimate bias and variance for different polynomial degrees.
    
    Process:
    1. Generate many training datasets from the same true function + noise
    2. Fit each polynomial degree to each dataset
    3. Evaluate predictions at fixed test points
    4. Compute empirical bias and variance across simulations
    """
    x_min, x_max = x_range
    
    # Fixed test points for evaluation
    x_test = np.linspace(x_min, x_max, n_test)
    y_true = true_func(x_test)
    
    results = {d: {'predictions': np.zeros((n_simulations, n_test))} 
               for d in degrees}
    
    for sim in range(n_simulations):
        # Generate training data
        x_train = np.random.uniform(x_min, x_max, n_train)
        y_train = true_func(x_train) + np.random.normal(0, noise_std, n_train)
        
        # Fit each degree
        for d in degrees:
            # Center and scale
            x_mean, x_std = np.mean(x_train), np.std(x_train)
            if x_std < 1e-10: x_std = 1.0
            z_train = (x_train - x_mean) / x_std
            z_test = (x_test - x_mean) / x_std
            
            # Fit polynomial
            Phi_train = np.column_stack([z_train**k for k in range(d + 1)])
            Phi_test = np.column_stack([z_test**k for k in range(d + 1)])
            
            try:
                beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
                y_pred = Phi_test @ beta
            except:
                y_pred = np.full(n_test, np.nan)
            
            results[d]['predictions'][sim, :] = y_pred
    
    # Compute bias and variance at each test point
    for d in degrees:
        preds = results[d]['predictions']
        mean_pred = np.nanmean(preds, axis=0)  # E[f_hat(x)]
        
        # Bias = E[f_hat(x)] - f(x)
        bias = mean_pred - y_true
        bias_squared = np.nanmean(bias**2)  # Average over test points
        
        # Variance = E[(f_hat(x) - E[f_hat(x)])^2]
        variance = np.nanmean(np.nanvar(preds, axis=0))
        
        # Total expected error (averaged over simulations and test points)
        mse = np.nanmean((preds - y_true)**2)
        
        results[d]['bias_squared'] = bias_squared
        results[d]['variance'] = variance
        results[d]['mse'] = mse
        results[d]['irreducible'] = noise_std**2
    
    return results
 
# True function: cubic
true_func = lambda x: 0.5 * x**3 - 2 * x**2 + x + 1
 
# Run simulation
degrees = [1, 2, 3, 4, 5, 7, 10, 15]
results = bias_variance_simulation(
    true_func, x_range=(-2, 3), n_train=30, noise_std=1.0,
    degrees=degrees, n_simulations=200
)
 
# Display results
print("Bias-Variance Decomposition by Polynomial Degree")
print("True function: 0.5x³ - 2x² + x + 1 (cubic)")
print("=" * 70)
print(f"{'Degree':>8} | {'Bias²':>12} | {'Variance':>12} | {'MSE':>12} | {'Optimal?':>10}")
print("-" * 70)
 
min_mse = min(r['mse'] for r in results.values())
for d in degrees:
    r = results[d]
    is_optimal = "  ✓" if abs(r['mse'] - min_mse) < 0.1 else ""
    print(f"{d:>8} | {r['bias_squared']:>12.4f} | {r['variance']:>12.4f} | "
          f"{r['mse']:>12.4f} | {is_optimal:>10}")

Reading the Bias-Variance Table

In the simulation above, the true function is cubic (degree 3). Notice that degrees 1-2 have high bias (can't capture cubic shape), degrees 3-4 have optimal MSE (bias low, variance controlled), and degrees 7+ show exploding variance despite zero bias. The optimal degree captures true complexity without inviting unnecessary variance.

Cross-Validation for Degree Selection

The Core Idea:

Cross-validation (CV) estimates how well a model generalizes to unseen data by systematically holding out portions of the training data for validation. For polynomial degree selection, we fit polynomials of different degrees and choose the one with lowest CV error.

K-Fold Cross-Validation:

Split data into $K$ equal-sized folds
For each degree $d$ and each fold $k$:
- Train on all folds except $k$
- Evaluate error on fold $k$
Average error across folds gives CV error for degree $d$
Select degree with minimum CV error

The Mathematical Formulation:

Let $\kappa: \{1, \ldots, n\} \to \{1, \ldots, K\}$ be the fold assignment. The CV error for degree $d$ is:

$$\text{CV}K(d) = \frac{1}{n} \sum{i=1}^{n} \left( y_i - \hat{f}_{-\kappa(i)}^{(d)}(x_i) \right)^2$$

where $\hat{f}_{-\kappa(i)}^{(d)}$ is the degree-$d$ polynomial fitted on all data except fold $\kappa(i)$.

cv_degree_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from typing import Tuple, List
 
def k_fold_cv_polynomial(x: np.ndarray, y: np.ndarray, 
                          degrees: List[int], k: int = 5,
                          random_state: int = None) -> dict:
    """
    Perform K-fold cross-validation to select optimal polynomial degree.
    
    Parameters:
        x, y: Training data
        degrees: List of polynomial degrees to evaluate
        k: Number of folds
        random_state: Random seed for reproducibility
        
    Returns:
        Dictionary with CV errors, standard errors, and best degree
    """
    n = len(x)
    if random_state is not None:
        np.random.seed(random_state)
    
    # Random fold assignment
    indices = np.random.permutation(n)
    fold_sizes = np.full(k, n // k)
    fold_sizes[:n % k] += 1  # Distribute remainder
    
    fold_indices = []
    current = 0
    for size in fold_sizes:
        fold_indices.append(indices[current:current + size])
        current += size
    
    results = {d: {'fold_errors': []} for d in degrees}
    
    for d in degrees:
        for fold_idx in range(k):
            # Train on all folds except fold_idx
            val_idx = fold_indices[fold_idx]
            train_idx = np.concatenate([fold_indices[i] for i in range(k) if i != fold_idx])
            
            x_train, y_train = x[train_idx], y[train_idx]
            x_val, y_val = x[val_idx], y[val_idx]
            
            # Center and scale based on training data only
            x_mean, x_std = np.mean(x_train), np.std(x_train)
            if x_std < 1e-10: x_std = 1.0
            
            z_train = (x_train - x_mean) / x_std
            z_val = (x_val - x_mean) / x_std
            
            # Fit polynomial
            Phi_train = np.column_stack([z_train**j for j in range(d + 1)])
            Phi_val = np.column_stack([z_val**j for j in range(d + 1)])
            
            try:
                beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
                y_pred = Phi_val @ beta
                mse = np.mean((y_val - y_pred)**2)
            except:
                mse = np.inf
            
            results[d]['fold_errors'].append(mse)
        
        # Aggregate across folds
        fold_errors = results[d]['fold_errors']
        results[d]['cv_error'] = np.mean(fold_errors)
        results[d]['cv_std'] = np.std(fold_errors) / np.sqrt(k)  # Std error of mean
    
    # Find optimal degree
    cv_errors = {d: results[d]['cv_error'] for d in degrees}
    best_degree = min(cv_errors, key=cv_errors.get)
    
    # One-standard-error rule: simplest model within 1 SE of best
    best_cv_error = cv_errors[best_degree]
    threshold = best_cv_error + results[best_degree]['cv_std']
    
    one_se_degree = min(d for d in sorted(degrees) 
                        if cv_errors[d] <= threshold)
    
    return {
        'detailed': results,
        'best_degree': best_degree,
        'one_se_degree': one_se_degree,
        'cv_errors': cv_errors
    }
 
# Example usage
np.random.seed(42)
x = np.linspace(-3, 3, 60)
y_true = 0.3 * x**3 - 0.8 * x**2 + 0.5 * x + 2
y = y_true + np.random.normal(0, 1.5, len(x))
 
degrees = list(range(1, 12))
cv_results = k_fold_cv_polynomial(x, y, degrees, k=5, random_state=42)
 
print("K-Fold Cross-Validation Results (K=5)")
print("True function: cubic")
print("=" * 60)
print(f"{'Degree':>8} | {'CV Error':>12} | {'± Std Err':>12} | {'Status':>12}")
print("-" * 60)
 
for d in degrees:
    r = cv_results['detailed'][d]
    status = ""
    if d == cv_results['best_degree']:
        status = "BEST"
    elif d == cv_results['one_se_degree']:
        status = "1-SE RULE"
    print(f"{d:>8} | {r['cv_error']:>12.4f} | {r['cv_std']:>12.4f} | {status:>12}")
 
print(f"
Best degree (minimum CV): {cv_results['best_degree']}")
print(f"One-SE rule degree: {cv_results['one_se_degree']}")

The One-Standard-Error Rule

The one-standard-error rule is a powerful principle for model selection: choose the simplest model whose CV error is within one standard error of the minimum. This guards against overfitting to the validation sets themselves and typically yields more robust generalization. For polynomials, this often selects a degree 1-2 lower than the pure minimum CV choice.

Leave-One-Out Cross-Validation (LOOCV):

When $K = n$ (each fold contains exactly one observation), we get LOOCV. For polynomial regression specifically, LOOCV has a remarkable shortcut—we don't need to refit $n$ times!

The PRESS Statistic (Predicted Residual Error Sum of Squares):

$$\text{PRESS} = \sum_{i=1}^{n} \left( \frac{e_i}{1 - h_{ii}} \right)^2$$

where $e_i = y_i - \hat{y}i$ is the residual and $h{ii}$ is the $i$-th diagonal element of the hat matrix $\mathbf{H} = \mathbf{\Phi}(\mathbf{\Phi}^T \mathbf{\Phi})^{-1}\mathbf{\Phi}^T$.

This formula gives exact LOOCV error from a single model fit!

Information Criteria: AIC and BIC

The Core Idea:

Information criteria provide a principled way to balance model fit against model complexity, without the computational cost of cross-validation. They estimate out-of-sample prediction error by adding a penalty for the number of parameters.

Akaike Information Criterion (AIC):

$$\text{AIC} = n \cdot \ln(\text{RSS}/n) + 2(d+2)$$

where $\text{RSS} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ is the residual sum of squares and $(d+2)$ is the number of parameters (including variance estimation).

Bayesian Information Criterion (BIC):

$$\text{BIC} = n \cdot \ln(\text{RSS}/n) + \ln(n) \cdot (d+2)$$

BIC penalizes complexity more heavily (especially for large $n$), favoring simpler models.

Comparing AIC and BIC
Criterion	Penalty Term	Tendency	Best For
AIC	$2p$	Slightly favors complex models	Prediction accuracy
BIC	$\ln(n) \cdot p$	Strongly favors simpler models	Finding true model
AICc	$\frac{2p(p+1)}{n-p-1}$ correction	Corrects AIC for small samples	Small $n$ settings

information_criteria.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from typing import List
 
def compute_information_criteria(x: np.ndarray, y: np.ndarray,
                                   degrees: List[int]) -> dict:
    """
    Compute AIC, AICc, and BIC for polynomial models of various degrees.
    
    Returns dictionary with all criteria values and selected degrees.
    """
    n = len(y)
    
    # Center and scale
    x_mean, x_std = np.mean(x), np.std(x)
    if x_std < 1e-10: x_std = 1.0
    z = (x - x_mean) / x_std
    
    results = {}
    
    for d in degrees:
        # Fit model
        Phi = np.column_stack([z**k for k in range(d + 1)])
        beta = np.linalg.lstsq(Phi, y, rcond=None)[0]
        y_pred = Phi @ beta
        
        # Residual sum of squares
        rss = np.sum((y - y_pred)**2)
        
        # Number of parameters (coefficients + variance)
        p = d + 2  # d+1 coefficients + 1 variance parameter
        
        # Log-likelihood term (proportional)
        ll_term = n * np.log(rss / n)
        
        # AIC
        aic = ll_term + 2 * p
        
        # AICc (corrected AIC for small samples)
        if n - p - 1 > 0:
            aicc_correction = (2 * p * (p + 1)) / (n - p - 1)
            aicc = aic + aicc_correction
        else:
            aicc = np.inf
        
        # BIC
        bic = ll_term + np.log(n) * p
        
        # R-squared for reference
        tss = np.sum((y - np.mean(y))**2)
        r_squared = 1 - rss / tss
        adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p)
        
        results[d] = {
            'aic': aic,
            'aicc': aicc,
            'bic': bic,
            'rss': rss,
            'r_squared': r_squared,
            'adj_r_squared': adj_r_squared
        }
    
    # Select best degree by each criterion
    best_aic = min(degrees, key=lambda d: results[d]['aic'])
    best_aicc = min(degrees, key=lambda d: results[d]['aicc'])
    best_bic = min(degrees, key=lambda d: results[d]['bic'])
    best_adj_r2 = max(degrees, key=lambda d: results[d]['adj_r_squared'])
    
    return {
        'detailed': results,
        'best_aic': best_aic,
        'best_aicc': best_aicc,
        'best_bic': best_bic,
        'best_adj_r2': best_adj_r2
    }
 
# Example with synthetic data
np.random.seed(42)
x = np.linspace(-3, 3, 50)
y_true = 1.5 * x**2 - 0.5 * x + 2  # Quadratic
y = y_true + np.random.normal(0, 1.5, len(x))
 
degrees = list(range(1, 10))
ic_results = compute_information_criteria(x, y, degrees)
 
print("Information Criteria for Polynomial Degree Selection")
print("True function: quadratic (degree 2)")
print("=" * 80)
print(f"{'Degree':>6} | {'AIC':>10} | {'AICc':>10} | {'BIC':>10} | {'Adj R²':>10} | {'Status':>10}")
print("-" * 80)
 
for d in degrees:
    r = ic_results['detailed'][d]
    status = []
    if d == ic_results['best_aic']: status.append("AIC")
    if d == ic_results['best_bic']: status.append("BIC")
    if d == ic_results['best_adj_r2']: status.append("Adj-R²")
    status_str = ", ".join(status) if status else ""
    
    print(f"{d:>6} | {r['aic']:>10.2f} | {r['aicc']:>10.2f} | "
          f"{r['bic']:>10.2f} | {r['adj_r_squared']:>10.4f} | {status_str:>10}")
 
print(f"
Recommendations:")
print(f"  AIC:    degree {ic_results['best_aic']}")
print(f"  AICc:   degree {ic_results['best_aicc']}")
print(f"  BIC:    degree {ic_results['best_bic']}")
print(f"  Adj R²: degree {ic_results['best_adj_r2']}")

Practical Guidance on AIC vs BIC

Use AIC when prediction accuracy is paramount—you want the model that predicts best on new data, even if it's slightly more complex than necessary. Use BIC when you believe there's a 'true' underlying model and want to identify it—BIC is consistent (converges to true model as $n \to \infty$). When AIC and BIC disagree significantly, examine both models carefully.

Hypothesis Testing for Model Comparison

The F-Test for Nested Models:

When comparing polynomial models of different degrees, we're comparing nested models—a degree-$d$ polynomial is nested within a degree-$(d+k)$ polynomial. The F-test provides a formal hypothesis test for whether the additional terms improve fit significantly.

Setup:

Reduced model (degree $d_1$): $\text{RSS}_1$ with $p_1 = d_1 + 1$ parameters
Full model (degree $d_2 > d_1$): $\text{RSS}_2$ with $p_2 = d_2 + 1$ parameters

Null Hypothesis: The extra polynomial terms have zero coefficients (reduced model is sufficient).

F-Statistic: $$F = \frac{(\text{RSS}_1 - \text{RSS}_2) / (p_2 - p_1)}{\text{RSS}_2 / (n - p_2)}$$

Under the null hypothesis, $F \sim F_{p_2 - p_1, n - p_2}$ (F-distribution).

f_test_polynomials.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
from typing import Tuple
 
def f_test_polynomial(x: np.ndarray, y: np.ndarray,
                       degree_reduced: int, degree_full: int,
                       alpha: float = 0.05) -> dict:
    """
    Perform F-test comparing two nested polynomial models.
    
    Parameters:
        x, y: Data arrays
        degree_reduced: Degree of simpler model
        degree_full: Degree of more complex model
        alpha: Significance level
        
    Returns:
        Dictionary with test statistics and decision
    """
    if degree_reduced >= degree_full:
        raise ValueError("degree_reduced must be less than degree_full")
    
    n = len(y)
    
    # Center and scale
    x_mean, x_std = np.mean(x), np.std(x)
    if x_std < 1e-10: x_std = 1.0
    z = (x - x_mean) / x_std
    
    # Fit reduced model
    Phi_reduced = np.column_stack([z**k for k in range(degree_reduced + 1)])
    beta_reduced = np.linalg.lstsq(Phi_reduced, y, rcond=None)[0]
    rss_reduced = np.sum((y - Phi_reduced @ beta_reduced)**2)
    p_reduced = degree_reduced + 1
    
    # Fit full model
    Phi_full = np.column_stack([z**k for k in range(degree_full + 1)])
    beta_full = np.linalg.lstsq(Phi_full, y, rcond=None)[0]
    rss_full = np.sum((y - Phi_full @ beta_full)**2)
    p_full = degree_full + 1
    
    # F-statistic
    df_num = p_full - p_reduced      # Extra parameters in full model
    df_den = n - p_full              # Residual degrees of freedom
    
    numerator = (rss_reduced - rss_full) / df_num
    denominator = rss_full / df_den
    
    f_stat = numerator / denominator
    p_value = 1 - stats.f.cdf(f_stat, df_num, df_den)
    f_critical = stats.f.ppf(1 - alpha, df_num, df_den)
    
    reject_null = p_value < alpha
    
    return {
        'f_statistic': f_stat,
        'p_value': p_value,
        'f_critical': f_critical,
        'df_numerator': df_num,
        'df_denominator': df_den,
        'rss_reduced': rss_reduced,
        'rss_full': rss_full,
        'reject_null': reject_null,
        'conclusion': 'Full model significantly better' if reject_null 
                      else 'Reduced model sufficient'
    }
 
def sequential_f_tests(x: np.ndarray, y: np.ndarray,
                        max_degree: int, alpha: float = 0.05) -> int:
    """
    Use sequential F-tests to select polynomial degree.
    
    Start with degree 1, keep adding terms until F-test fails.
    """
    print(f"Sequential F-Tests for Polynomial Degree Selection (α = {alpha})")
    print("=" * 70)
    print(f"{'Comparison':>20} | {'F-stat':>10} | {'p-value':>10} | {'Decision':>20}")
    print("-" * 70)
    
    selected_degree = 1
    
    for d in range(1, max_degree):
        result = f_test_polynomial(x, y, d, d + 1, alpha)
        
        comparison = f"degree {d} vs {d+1}"
        decision = "Add term" if result['reject_null'] else "Stop here"
        
        print(f"{comparison:>20} | {result['f_statistic']:>10.4f} | "
              f"{result['p_value']:>10.4f} | {decision:>20}")
        
        if result['reject_null']:
            selected_degree = d + 1
        else:
            break  # Higher terms not needed
    
    print(f"
Selected degree: {selected_degree}")
    return selected_degree
 
# Example
np.random.seed(42)
x = np.linspace(-3, 3, 80)
y_true = 0.5 * x**3 - x**2 + 2  # Cubic
y = y_true + np.random.normal(0, 1.5, len(x))
 
selected = sequential_f_tests(x, y, max_degree=8)

Multiple Testing Concerns

Sequential F-tests suffer from multiple testing issues—if you perform many tests, the overall type I error rate exceeds $\alpha$. For formal model selection, information criteria (AIC/BIC) or cross-validation are often preferred. F-tests are most useful for testing specific hypotheses (e.g., 'Is the quadratic term necessary?').

Practical Guidelines for Degree Selection

Beyond formal methods, experienced practitioners develop intuition for polynomial degree selection. Here are battle-tested guidelines:

Rules of Thumb

•Start simple: Begin with degree 1 (linear), add terms only when clearly needed. Complexity should be earned, not assumed.
•Domain knowledge rules: If physics or theory suggest a specific form (e.g., quadratic for energy relationships), use it directly rather than selecting blindly.
•Data-to-parameter ratio: Keep $n / p > 10$ as a minimum. For degree 5 with 6 parameters, need at least 60 observations.
•Visualize first: Plot the data! Clear curvature suggests degree 2; inflection points suggest degree 3. Let the data speak.
•Beware of boundaries: High-degree polynomials often misbehave near data boundaries. Check predictions at extremes of $x$ range.
•Use multiple methods: If CV, AIC, and BIC all suggest degree 3, that's strong evidence. Disagreement warrants deeper investigation.

Signs of Good Degree Choice

•Residuals show no systematic patterns
•CV error is near minimum
•Predictions at boundaries are reasonable
•Multiple criteria agree
•Adding more terms doesn't help significantly

Warning Signs

•Residuals have clear structure (curves, trends)
•Training R² >> CV R² (overfitting)
•Predictions explode outside training range
•Coefficients have huge magnitudes
•Model changes dramatically with small data changes

The Degree Selection Flowchart

Plot data → identify obvious patterns
Consult domain knowledge → does theory suggest a form?
Fit candidates (degrees 1-5 typically sufficient)
Compute CV / information criteria → identify minimum-error region
Apply one-SE rule → select simplest competitive model
Validate residuals → check for remaining patterns
Test boundary behavior → ensure predictions are sensible

Complete Example: Systematic Degree Selection

Let's work through a complete degree selection exercise using all the methods we've developed, demonstrating how to synthesize multiple lines of evidence.

complete_degree_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import numpy as np
from scipy import stats
 
def comprehensive_degree_selection(x: np.ndarray, y: np.ndarray,
                                     max_degree: int = 10,
                                     k_folds: int = 5,
                                     alpha: float = 0.05) -> dict:
    """
    Comprehensive polynomial degree selection using multiple methods.
    
    Combines: visual inspection, cross-validation, AIC, BIC, adjusted R².
    """
    n = len(y)
    x_mean, x_std = np.mean(x), np.std(x)
    if x_std < 1e-10: x_std = 1.0
    z = (x - x_mean) / x_std
    
    degrees = list(range(1, min(max_degree + 1, n // 3)))  # Safety limit
    
    results = {'degrees': degrees, 'metrics': {}}
    
    for d in degrees:
        # Fit model
        Phi = np.column_stack([z**k for k in range(d + 1)])
        beta = np.linalg.lstsq(Phi, y, rcond=None)[0]
        y_pred = Phi @ beta
        residuals = y - y_pred
        
        # Basic metrics
        rss = np.sum(residuals**2)
        tss = np.sum((y - np.mean(y))**2)
        p = d + 2  # Parameters including variance
        
        r2 = 1 - rss / tss
        adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p)
        
        # Information criteria
        ll_term = n * np.log(rss / n)
        aic = ll_term + 2 * p
        bic = ll_term + np.log(n) * p
        
        # Cross-validation (simplified k-fold)
        np.random.seed(42)  # Reproducibility
        indices = np.random.permutation(n)
        fold_size = n // k_folds
        cv_errors = []
        
        for k in range(k_folds):
            val_idx = indices[k * fold_size:(k + 1) * fold_size]
            train_idx = np.concatenate([indices[:k * fold_size], 
                                        indices[(k + 1) * fold_size:]])
            
            z_train, y_train = z[train_idx], y[train_idx]
            z_val, y_val = z[val_idx], y[val_idx]
            
            Phi_tr = np.column_stack([z_train**j for j in range(d + 1)])
            Phi_val = np.column_stack([z_val**j for j in range(d + 1)])
            
            beta_cv = np.linalg.lstsq(Phi_tr, y_train, rcond=None)[0]
            cv_pred = Phi_val @ beta_cv
            cv_errors.append(np.mean((y_val - cv_pred)**2))
        
        cv_mean = np.mean(cv_errors)
        cv_se = np.std(cv_errors) / np.sqrt(k_folds)
        
        results['metrics'][d] = {
            'r_squared': r2,
            'adj_r_squared': adj_r2,
            'aic': aic,
            'bic': bic,
            'cv_error': cv_mean,
            'cv_se': cv_se,
            'coefficients': beta
        }
    
    # Recommendations
    metrics = results['metrics']
    
    best_cv = min(degrees, key=lambda d: metrics[d]['cv_error'])
    best_aic = min(degrees, key=lambda d: metrics[d]['aic'])
    best_bic = min(degrees, key=lambda d: metrics[d]['bic'])
    best_adj_r2 = max(degrees, key=lambda d: metrics[d]['adj_r_squared'])
    
    # One-SE rule
    cv_threshold = metrics[best_cv]['cv_error'] + metrics[best_cv]['cv_se']
    one_se = min(d for d in degrees if metrics[d]['cv_error'] <= cv_threshold)
    
    results['recommendations'] = {
        'cv_minimum': best_cv,
        'cv_one_se': one_se,
        'aic': best_aic,
        'bic': best_bic,
        'adj_r_squared': best_adj_r2
    }
    
    # Consensus (most common recommendation)
    recs = list(results['recommendations'].values())
    from collections import Counter
    counts = Counter(recs)
    results['consensus'] = counts.most_common(1)[0][0]
    
    return results
 
def print_selection_report(results: dict):
    """Pretty-print the degree selection analysis."""
    
    print("
" + "=" * 80)
    print("POLYNOMIAL DEGREE SELECTION ANALYSIS")
    print("=" * 80)
    
    # Metrics table
    print("
--- Performance Metrics by Degree ---")
    print(f"{'Degree':>6} | {'R²':>8} | {'Adj R²':>8} | {'AIC':>10} | "
          f"{'BIC':>10} | {'CV Error':>10} | {'CV SE':>8}")
    print("-" * 80)
    
    for d in results['degrees']:
        m = results['metrics'][d]
        print(f"{d:>6} | {m['r_squared']:>8.4f} | {m['adj_r_squared']:>8.4f} | "
              f"{m['aic']:>10.2f} | {m['bic']:>10.2f} | "
              f"{m['cv_error']:>10.4f} | {m['cv_se']:>8.4f}")
    
    # Recommendations
    print("
--- Method Recommendations ---")
    recs = results['recommendations']
    print(f"  Cross-validation (minimum): degree {recs['cv_minimum']}")
    print(f"  Cross-validation (1-SE):    degree {recs['cv_one_se']}")
    print(f"  AIC:                        degree {recs['aic']}")
    print(f"  BIC:                        degree {recs['bic']}")
    print(f"  Adjusted R²:                degree {recs['adj_r_squared']}")
    
    print(f"
  >>> CONSENSUS: degree {results['consensus']} <<<")
    print("=" * 80)
 
# Application to real-world-like data
np.random.seed(123)
x = np.linspace(0, 10, 80)
# True relationship: quadratic with diminishing returns
y_true = 5 + 3*x - 0.15*x**2
y = y_true + np.random.normal(0, 1.5, len(x))
 
results = comprehensive_degree_selection(x, y, max_degree=8)
print_selection_report(results)

Interpretation

When methods agree (all recommending degree 2 for truly quadratic data), you can be confident in your choice. When they disagree, investigate why—perhaps the data has features that different criteria weight differently. The consensus approach provides a robust default.

Summary: Degree Selection

We've developed a comprehensive toolkit for polynomial degree selection. Let's consolidate:

Key Takeaways

•Bias-variance tradeoff governs degree selection: too low = high bias, too high = high variance. Optimal degree minimizes their sum.
•Cross-validation directly estimates out-of-sample error; the one-SE rule provides robustness by favoring simpler models.
•AIC targets prediction accuracy; BIC targets true model identification and penalizes complexity more heavily.
•F-tests provide formal hypothesis testing for comparing nested polynomial models, but suffer from multiple testing issues.
•Practical heuristics (visual inspection, domain knowledge, data-to-parameter ratios) complement formal methods.
•Consensus across methods provides the most robust degree selection; disagreement signals need for deeper analysis.

What's Next:

Degree selection addresses overfitting through model simplification. But high-degree polynomials remain useful when paired with regularization—adding penalty terms that shrink coefficients. The next page explores overfitting in polynomial regression more deeply and shows how regularization (Ridge, Lasso) provides an alternative to degree selection.

Page Complete

You now have a complete toolkit for polynomial degree selection: bias-variance analysis, cross-validation, information criteria, hypothesis testing, and practical guidelines. These methods apply broadly to any model selection problem involving complexity parameters.

Degree Selection

The Goldilocks Problem of Polynomial Regression

Consider fitting a polynomial to noisy data:

Degree too low (underfitting): The polynomial can't capture the true relationship. High bias, low variance.
Degree too high (overfitting): The polynomial contorts to pass through every noise point. Low bias, high variance.
Degree just right: Captures the true pattern without fitting noise. Optimal bias-variance trade-off.

This page develops rigorous, principled methods for finding that optimal degree—moving beyond intuition to systematic model selection.

What You Will Learn

The Bias-Variance Perspective on Polynomial Degree

The Expected Prediction Error:

For a new test point $x_0$ with true value $y_0 = f(x_0) + \epsilon$ (where $\epsilon$ is noise with variance $\sigma^2$), the expected squared error can be decomposed:

$$\mathbb{E}[(y_0 - \hat{f}(x_0))^2] = \text{Bias}^2[\hat{f}(x_0)] + \text{Var}[\hat{f}(x_0)] + \sigma^2$$

As polynomial degree $d$ increases:

Component	Effect of Increasing $d$
Bias²	Decreases (more flexibility to match true $f$)
Variance	Increases (more parameters → more sensitivity to training noise)
Irreducible error $\sigma^2$	Unchanged (inherent noise)

The optimal degree minimizes the sum Bias² + Variance, not either component alone.

bias_variance_polynomial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List
 
def bias_variance_simulation(true_func, x_range: Tuple[float, float],
                              n_train: int, noise_std: float,
                              degrees: List[int], n_simulations: int = 200,
                              n_test: int = 50) -> dict:
    """
    Empirically estimate bias and variance for different polynomial degrees.
    
    Process:
    1. Generate many training datasets from the same true function + noise
    2. Fit each polynomial degree to each dataset
    3. Evaluate predictions at fixed test points
    4. Compute empirical bias and variance across simulations
    """
    x_min, x_max = x_range
    
    # Fixed test points for evaluation
    x_test = np.linspace(x_min, x_max, n_test)
    y_true = true_func(x_test)
    
    results = {d: {'predictions': np.zeros((n_simulations, n_test))} 
               for d in degrees}
    
    for sim in range(n_simulations):
        # Generate training data
        x_train = np.random.uniform(x_min, x_max, n_train)
        y_train = true_func(x_train) + np.random.normal(0, noise_std, n_train)
        
        # Fit each degree
        for d in degrees:
            # Center and scale
            x_mean, x_std = np.mean(x_train), np.std(x_train)
            if x_std < 1e-10: x_std = 1.0
            z_train = (x_train - x_mean) / x_std
            z_test = (x_test - x_mean) / x_std
            
            # Fit polynomial
            Phi_train = np.column_stack([z_train**k for k in range(d + 1)])
            Phi_test = np.column_stack([z_test**k for k in range(d + 1)])
            
            try:
                beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
                y_pred = Phi_test @ beta
            except:
                y_pred = np.full(n_test, np.nan)
            
            results[d]['predictions'][sim, :] = y_pred
    
    # Compute bias and variance at each test point
    for d in degrees:
        preds = results[d]['predictions']
        mean_pred = np.nanmean(preds, axis=0)  # E[f_hat(x)]
        
        # Bias = E[f_hat(x)] - f(x)
        bias = mean_pred - y_true
        bias_squared = np.nanmean(bias**2)  # Average over test points
        
        # Variance = E[(f_hat(x) - E[f_hat(x)])^2]
        variance = np.nanmean(np.nanvar(preds, axis=0))
        
        # Total expected error (averaged over simulations and test points)
        mse = np.nanmean((preds - y_true)**2)
        
        results[d]['bias_squared'] = bias_squared
        results[d]['variance'] = variance
        results[d]['mse'] = mse
        results[d]['irreducible'] = noise_std**2
    
    return results
 
# True function: cubic
true_func = lambda x: 0.5 * x**3 - 2 * x**2 + x + 1
 
# Run simulation
degrees = [1, 2, 3, 4, 5, 7, 10, 15]
results = bias_variance_simulation(
    true_func, x_range=(-2, 3), n_train=30, noise_std=1.0,
    degrees=degrees, n_simulations=200
)
 
# Display results
print("Bias-Variance Decomposition by Polynomial Degree")
print("True function: 0.5x³ - 2x² + x + 1 (cubic)")
print("=" * 70)
print(f"{'Degree':>8} | {'Bias²':>12} | {'Variance':>12} | {'MSE':>12} | {'Optimal?':>10}")
print("-" * 70)
 
min_mse = min(r['mse'] for r in results.values())
for d in degrees:
    r = results[d]
    is_optimal = "  ✓" if abs(r['mse'] - min_mse) < 0.1 else ""
    print(f"{d:>8} | {r['bias_squared']:>12.4f} | {r['variance']:>12.4f} | "
          f"{r['mse']:>12.4f} | {is_optimal:>10}")

Reading the Bias-Variance Table

Cross-Validation for Degree Selection

The Core Idea:

K-Fold Cross-Validation:

Split data into $K$ equal-sized folds
For each degree $d$ and each fold $k$:
- Train on all folds except $k$
- Evaluate error on fold $k$
Average error across folds gives CV error for degree $d$
Select degree with minimum CV error

The Mathematical Formulation:

Let $\kappa: \{1, \ldots, n\} \to \{1, \ldots, K\}$ be the fold assignment. The CV error for degree $d$ is:

$$\text{CV}K(d) = \frac{1}{n} \sum{i=1}^{n} \left( y_i - \hat{f}_{-\kappa(i)}^{(d)}(x_i) \right)^2$$

where $\hat{f}_{-\kappa(i)}^{(d)}$ is the degree-$d$ polynomial fitted on all data except fold $\kappa(i)$.

cv_degree_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from typing import Tuple, List
 
def k_fold_cv_polynomial(x: np.ndarray, y: np.ndarray, 
                          degrees: List[int], k: int = 5,
                          random_state: int = None) -> dict:
    """
    Perform K-fold cross-validation to select optimal polynomial degree.
    
    Parameters:
        x, y: Training data
        degrees: List of polynomial degrees to evaluate
        k: Number of folds
        random_state: Random seed for reproducibility
        
    Returns:
        Dictionary with CV errors, standard errors, and best degree
    """
    n = len(x)
    if random_state is not None:
        np.random.seed(random_state)
    
    # Random fold assignment
    indices = np.random.permutation(n)
    fold_sizes = np.full(k, n // k)
    fold_sizes[:n % k] += 1  # Distribute remainder
    
    fold_indices = []
    current = 0
    for size in fold_sizes:
        fold_indices.append(indices[current:current + size])
        current += size
    
    results = {d: {'fold_errors': []} for d in degrees}
    
    for d in degrees:
        for fold_idx in range(k):
            # Train on all folds except fold_idx
            val_idx = fold_indices[fold_idx]
            train_idx = np.concatenate([fold_indices[i] for i in range(k) if i != fold_idx])
            
            x_train, y_train = x[train_idx], y[train_idx]
            x_val, y_val = x[val_idx], y[val_idx]
            
            # Center and scale based on training data only
            x_mean, x_std = np.mean(x_train), np.std(x_train)
            if x_std < 1e-10: x_std = 1.0
            
            z_train = (x_train - x_mean) / x_std
            z_val = (x_val - x_mean) / x_std
            
            # Fit polynomial
            Phi_train = np.column_stack([z_train**j for j in range(d + 1)])
            Phi_val = np.column_stack([z_val**j for j in range(d + 1)])
            
            try:
                beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0]
                y_pred = Phi_val @ beta
                mse = np.mean((y_val - y_pred)**2)
            except:
                mse = np.inf
            
            results[d]['fold_errors'].append(mse)
        
        # Aggregate across folds
        fold_errors = results[d]['fold_errors']
        results[d]['cv_error'] = np.mean(fold_errors)
        results[d]['cv_std'] = np.std(fold_errors) / np.sqrt(k)  # Std error of mean
    
    # Find optimal degree
    cv_errors = {d: results[d]['cv_error'] for d in degrees}
    best_degree = min(cv_errors, key=cv_errors.get)
    
    # One-standard-error rule: simplest model within 1 SE of best
    best_cv_error = cv_errors[best_degree]
    threshold = best_cv_error + results[best_degree]['cv_std']
    
    one_se_degree = min(d for d in sorted(degrees) 
                        if cv_errors[d] <= threshold)
    
    return {
        'detailed': results,
        'best_degree': best_degree,
        'one_se_degree': one_se_degree,
        'cv_errors': cv_errors
    }
 
# Example usage
np.random.seed(42)
x = np.linspace(-3, 3, 60)
y_true = 0.3 * x**3 - 0.8 * x**2 + 0.5 * x + 2
y = y_true + np.random.normal(0, 1.5, len(x))
 
degrees = list(range(1, 12))
cv_results = k_fold_cv_polynomial(x, y, degrees, k=5, random_state=42)
 
print("K-Fold Cross-Validation Results (K=5)")
print("True function: cubic")
print("=" * 60)
print(f"{'Degree':>8} | {'CV Error':>12} | {'± Std Err':>12} | {'Status':>12}")
print("-" * 60)
 
for d in degrees:
    r = cv_results['detailed'][d]
    status = ""
    if d == cv_results['best_degree']:
        status = "BEST"
    elif d == cv_results['one_se_degree']:
        status = "1-SE RULE"
    print(f"{d:>8} | {r['cv_error']:>12.4f} | {r['cv_std']:>12.4f} | {status:>12}")
 
print(f"
Best degree (minimum CV): {cv_results['best_degree']}")
print(f"One-SE rule degree: {cv_results['one_se_degree']}")

The One-Standard-Error Rule

Leave-One-Out Cross-Validation (LOOCV):

When $K = n$ (each fold contains exactly one observation), we get LOOCV. For polynomial regression specifically, LOOCV has a remarkable shortcut—we don't need to refit $n$ times!

The PRESS Statistic (Predicted Residual Error Sum of Squares):

$$\text{PRESS} = \sum_{i=1}^{n} \left( \frac{e_i}{1 - h_{ii}} \right)^2$$

where $e_i = y_i - \hat{y}i$ is the residual and $h{ii}$ is the $i$-th diagonal element of the hat matrix $\mathbf{H} = \mathbf{\Phi}(\mathbf{\Phi}^T \mathbf{\Phi})^{-1}\mathbf{\Phi}^T$.

This formula gives exact LOOCV error from a single model fit!

Information Criteria: AIC and BIC

The Core Idea:

Akaike Information Criterion (AIC):

$$\text{AIC} = n \cdot \ln(\text{RSS}/n) + 2(d+2)$$

where $\text{RSS} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ is the residual sum of squares and $(d+2)$ is the number of parameters (including variance estimation).

Bayesian Information Criterion (BIC):

$$\text{BIC} = n \cdot \ln(\text{RSS}/n) + \ln(n) \cdot (d+2)$$

BIC penalizes complexity more heavily (especially for large $n$), favoring simpler models.

Comparing AIC and BIC
Criterion	Penalty Term	Tendency	Best For
AIC	$2p$	Slightly favors complex models	Prediction accuracy
BIC	$\ln(n) \cdot p$	Strongly favors simpler models	Finding true model
AICc	$\frac{2p(p+1)}{n-p-1}$ correction	Corrects AIC for small samples	Small $n$ settings

information_criteria.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from typing import List
 
def compute_information_criteria(x: np.ndarray, y: np.ndarray,
                                   degrees: List[int]) -> dict:
    """
    Compute AIC, AICc, and BIC for polynomial models of various degrees.
    
    Returns dictionary with all criteria values and selected degrees.
    """
    n = len(y)
    
    # Center and scale
    x_mean, x_std = np.mean(x), np.std(x)
    if x_std < 1e-10: x_std = 1.0
    z = (x - x_mean) / x_std
    
    results = {}
    
    for d in degrees:
        # Fit model
        Phi = np.column_stack([z**k for k in range(d + 1)])
        beta = np.linalg.lstsq(Phi, y, rcond=None)[0]
        y_pred = Phi @ beta
        
        # Residual sum of squares
        rss = np.sum((y - y_pred)**2)
        
        # Number of parameters (coefficients + variance)
        p = d + 2  # d+1 coefficients + 1 variance parameter
        
        # Log-likelihood term (proportional)
        ll_term = n * np.log(rss / n)
        
        # AIC
        aic = ll_term + 2 * p
        
        # AICc (corrected AIC for small samples)
        if n - p - 1 > 0:
            aicc_correction = (2 * p * (p + 1)) / (n - p - 1)
            aicc = aic + aicc_correction
        else:
            aicc = np.inf
        
        # BIC
        bic = ll_term + np.log(n) * p
        
        # R-squared for reference
        tss = np.sum((y - np.mean(y))**2)
        r_squared = 1 - rss / tss
        adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p)
        
        results[d] = {
            'aic': aic,
            'aicc': aicc,
            'bic': bic,
            'rss': rss,
            'r_squared': r_squared,
            'adj_r_squared': adj_r_squared
        }
    
    # Select best degree by each criterion
    best_aic = min(degrees, key=lambda d: results[d]['aic'])
    best_aicc = min(degrees, key=lambda d: results[d]['aicc'])
    best_bic = min(degrees, key=lambda d: results[d]['bic'])
    best_adj_r2 = max(degrees, key=lambda d: results[d]['adj_r_squared'])
    
    return {
        'detailed': results,
        'best_aic': best_aic,
        'best_aicc': best_aicc,
        'best_bic': best_bic,
        'best_adj_r2': best_adj_r2
    }
 
# Example with synthetic data
np.random.seed(42)
x = np.linspace(-3, 3, 50)
y_true = 1.5 * x**2 - 0.5 * x + 2  # Quadratic
y = y_true + np.random.normal(0, 1.5, len(x))
 
degrees = list(range(1, 10))
ic_results = compute_information_criteria(x, y, degrees)
 
print("Information Criteria for Polynomial Degree Selection")
print("True function: quadratic (degree 2)")
print("=" * 80)
print(f"{'Degree':>6} | {'AIC':>10} | {'AICc':>10} | {'BIC':>10} | {'Adj R²':>10} | {'Status':>10}")
print("-" * 80)
 
for d in degrees:
    r = ic_results['detailed'][d]
    status = []
    if d == ic_results['best_aic']: status.append("AIC")
    if d == ic_results['best_bic']: status.append("BIC")
    if d == ic_results['best_adj_r2']: status.append("Adj-R²")
    status_str = ", ".join(status) if status else ""
    
    print(f"{d:>6} | {r['aic']:>10.2f} | {r['aicc']:>10.2f} | "
          f"{r['bic']:>10.2f} | {r['adj_r_squared']:>10.4f} | {status_str:>10}")
 
print(f"
Recommendations:")
print(f"  AIC:    degree {ic_results['best_aic']}")
print(f"  AICc:   degree {ic_results['best_aicc']}")
print(f"  BIC:    degree {ic_results['best_bic']}")
print(f"  Adj R²: degree {ic_results['best_adj_r2']}")

Practical Guidance on AIC vs BIC

Hypothesis Testing for Model Comparison

The F-Test for Nested Models:

Setup:

Reduced model (degree $d_1$): $\text{RSS}_1$ with $p_1 = d_1 + 1$ parameters
Full model (degree $d_2 > d_1$): $\text{RSS}_2$ with $p_2 = d_2 + 1$ parameters

Null Hypothesis: The extra polynomial terms have zero coefficients (reduced model is sufficient).

F-Statistic: $$F = \frac{(\text{RSS}_1 - \text{RSS}_2) / (p_2 - p_1)}{\text{RSS}_2 / (n - p_2)}$$

Under the null hypothesis, $F \sim F_{p_2 - p_1, n - p_2}$ (F-distribution).

f_test_polynomials.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
from typing import Tuple
 
def f_test_polynomial(x: np.ndarray, y: np.ndarray,
                       degree_reduced: int, degree_full: int,
                       alpha: float = 0.05) -> dict:
    """
    Perform F-test comparing two nested polynomial models.
    
    Parameters:
        x, y: Data arrays
        degree_reduced: Degree of simpler model
        degree_full: Degree of more complex model
        alpha: Significance level
        
    Returns:
        Dictionary with test statistics and decision
    """
    if degree_reduced >= degree_full:
        raise ValueError("degree_reduced must be less than degree_full")
    
    n = len(y)
    
    # Center and scale
    x_mean, x_std = np.mean(x), np.std(x)
    if x_std < 1e-10: x_std = 1.0
    z = (x - x_mean) / x_std
    
    # Fit reduced model
    Phi_reduced = np.column_stack([z**k for k in range(degree_reduced + 1)])
    beta_reduced = np.linalg.lstsq(Phi_reduced, y, rcond=None)[0]
    rss_reduced = np.sum((y - Phi_reduced @ beta_reduced)**2)
    p_reduced = degree_reduced + 1
    
    # Fit full model
    Phi_full = np.column_stack([z**k for k in range(degree_full + 1)])
    beta_full = np.linalg.lstsq(Phi_full, y, rcond=None)[0]
    rss_full = np.sum((y - Phi_full @ beta_full)**2)
    p_full = degree_full + 1
    
    # F-statistic
    df_num = p_full - p_reduced      # Extra parameters in full model
    df_den = n - p_full              # Residual degrees of freedom
    
    numerator = (rss_reduced - rss_full) / df_num
    denominator = rss_full / df_den
    
    f_stat = numerator / denominator
    p_value = 1 - stats.f.cdf(f_stat, df_num, df_den)
    f_critical = stats.f.ppf(1 - alpha, df_num, df_den)
    
    reject_null = p_value < alpha
    
    return {
        'f_statistic': f_stat,
        'p_value': p_value,
        'f_critical': f_critical,
        'df_numerator': df_num,
        'df_denominator': df_den,
        'rss_reduced': rss_reduced,
        'rss_full': rss_full,
        'reject_null': reject_null,
        'conclusion': 'Full model significantly better' if reject_null 
                      else 'Reduced model sufficient'
    }
 
def sequential_f_tests(x: np.ndarray, y: np.ndarray,
                        max_degree: int, alpha: float = 0.05) -> int:
    """
    Use sequential F-tests to select polynomial degree.
    
    Start with degree 1, keep adding terms until F-test fails.
    """
    print(f"Sequential F-Tests for Polynomial Degree Selection (α = {alpha})")
    print("=" * 70)
    print(f"{'Comparison':>20} | {'F-stat':>10} | {'p-value':>10} | {'Decision':>20}")
    print("-" * 70)
    
    selected_degree = 1
    
    for d in range(1, max_degree):
        result = f_test_polynomial(x, y, d, d + 1, alpha)
        
        comparison = f"degree {d} vs {d+1}"
        decision = "Add term" if result['reject_null'] else "Stop here"
        
        print(f"{comparison:>20} | {result['f_statistic']:>10.4f} | "
              f"{result['p_value']:>10.4f} | {decision:>20}")
        
        if result['reject_null']:
            selected_degree = d + 1
        else:
            break  # Higher terms not needed
    
    print(f"
Selected degree: {selected_degree}")
    return selected_degree
 
# Example
np.random.seed(42)
x = np.linspace(-3, 3, 80)
y_true = 0.5 * x**3 - x**2 + 2  # Cubic
y = y_true + np.random.normal(0, 1.5, len(x))
 
selected = sequential_f_tests(x, y, max_degree=8)

Multiple Testing Concerns

Practical Guidelines for Degree Selection

Beyond formal methods, experienced practitioners develop intuition for polynomial degree selection. Here are battle-tested guidelines:

Rules of Thumb

•Start simple: Begin with degree 1 (linear), add terms only when clearly needed. Complexity should be earned, not assumed.
•Domain knowledge rules: If physics or theory suggest a specific form (e.g., quadratic for energy relationships), use it directly rather than selecting blindly.
•Data-to-parameter ratio: Keep $n / p > 10$ as a minimum. For degree 5 with 6 parameters, need at least 60 observations.
•Visualize first: Plot the data! Clear curvature suggests degree 2; inflection points suggest degree 3. Let the data speak.
•Beware of boundaries: High-degree polynomials often misbehave near data boundaries. Check predictions at extremes of $x$ range.
•Use multiple methods: If CV, AIC, and BIC all suggest degree 3, that's strong evidence. Disagreement warrants deeper investigation.

Signs of Good Degree Choice

•Residuals show no systematic patterns
•CV error is near minimum
•Predictions at boundaries are reasonable
•Multiple criteria agree
•Adding more terms doesn't help significantly

Warning Signs

•Residuals have clear structure (curves, trends)
•Training R² >> CV R² (overfitting)
•Predictions explode outside training range
•Coefficients have huge magnitudes
•Model changes dramatically with small data changes

The Degree Selection Flowchart

Plot data → identify obvious patterns
Consult domain knowledge → does theory suggest a form?
Fit candidates (degrees 1-5 typically sufficient)
Compute CV / information criteria → identify minimum-error region
Apply one-SE rule → select simplest competitive model
Validate residuals → check for remaining patterns
Test boundary behavior → ensure predictions are sensible

Complete Example: Systematic Degree Selection

Let's work through a complete degree selection exercise using all the methods we've developed, demonstrating how to synthesize multiple lines of evidence.

complete_degree_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import numpy as np
from scipy import stats
 
def comprehensive_degree_selection(x: np.ndarray, y: np.ndarray,
                                     max_degree: int = 10,
                                     k_folds: int = 5,
                                     alpha: float = 0.05) -> dict:
    """
    Comprehensive polynomial degree selection using multiple methods.
    
    Combines: visual inspection, cross-validation, AIC, BIC, adjusted R².
    """
    n = len(y)
    x_mean, x_std = np.mean(x), np.std(x)
    if x_std < 1e-10: x_std = 1.0
    z = (x - x_mean) / x_std
    
    degrees = list(range(1, min(max_degree + 1, n // 3)))  # Safety limit
    
    results = {'degrees': degrees, 'metrics': {}}
    
    for d in degrees:
        # Fit model
        Phi = np.column_stack([z**k for k in range(d + 1)])
        beta = np.linalg.lstsq(Phi, y, rcond=None)[0]
        y_pred = Phi @ beta
        residuals = y - y_pred
        
        # Basic metrics
        rss = np.sum(residuals**2)
        tss = np.sum((y - np.mean(y))**2)
        p = d + 2  # Parameters including variance
        
        r2 = 1 - rss / tss
        adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p)
        
        # Information criteria
        ll_term = n * np.log(rss / n)
        aic = ll_term + 2 * p
        bic = ll_term + np.log(n) * p
        
        # Cross-validation (simplified k-fold)
        np.random.seed(42)  # Reproducibility
        indices = np.random.permutation(n)
        fold_size = n // k_folds
        cv_errors = []
        
        for k in range(k_folds):
            val_idx = indices[k * fold_size:(k + 1) * fold_size]
            train_idx = np.concatenate([indices[:k * fold_size], 
                                        indices[(k + 1) * fold_size:]])
            
            z_train, y_train = z[train_idx], y[train_idx]
            z_val, y_val = z[val_idx], y[val_idx]
            
            Phi_tr = np.column_stack([z_train**j for j in range(d + 1)])
            Phi_val = np.column_stack([z_val**j for j in range(d + 1)])
            
            beta_cv = np.linalg.lstsq(Phi_tr, y_train, rcond=None)[0]
            cv_pred = Phi_val @ beta_cv
            cv_errors.append(np.mean((y_val - cv_pred)**2))
        
        cv_mean = np.mean(cv_errors)
        cv_se = np.std(cv_errors) / np.sqrt(k_folds)
        
        results['metrics'][d] = {
            'r_squared': r2,
            'adj_r_squared': adj_r2,
            'aic': aic,
            'bic': bic,
            'cv_error': cv_mean,
            'cv_se': cv_se,
            'coefficients': beta
        }
    
    # Recommendations
    metrics = results['metrics']
    
    best_cv = min(degrees, key=lambda d: metrics[d]['cv_error'])
    best_aic = min(degrees, key=lambda d: metrics[d]['aic'])
    best_bic = min(degrees, key=lambda d: metrics[d]['bic'])
    best_adj_r2 = max(degrees, key=lambda d: metrics[d]['adj_r_squared'])
    
    # One-SE rule
    cv_threshold = metrics[best_cv]['cv_error'] + metrics[best_cv]['cv_se']
    one_se = min(d for d in degrees if metrics[d]['cv_error'] <= cv_threshold)
    
    results['recommendations'] = {
        'cv_minimum': best_cv,
        'cv_one_se': one_se,
        'aic': best_aic,
        'bic': best_bic,
        'adj_r_squared': best_adj_r2
    }
    
    # Consensus (most common recommendation)
    recs = list(results['recommendations'].values())
    from collections import Counter
    counts = Counter(recs)
    results['consensus'] = counts.most_common(1)[0][0]
    
    return results
 
def print_selection_report(results: dict):
    """Pretty-print the degree selection analysis."""
    
    print("
" + "=" * 80)
    print("POLYNOMIAL DEGREE SELECTION ANALYSIS")
    print("=" * 80)
    
    # Metrics table
    print("
--- Performance Metrics by Degree ---")
    print(f"{'Degree':>6} | {'R²':>8} | {'Adj R²':>8} | {'AIC':>10} | "
          f"{'BIC':>10} | {'CV Error':>10} | {'CV SE':>8}")
    print("-" * 80)
    
    for d in results['degrees']:
        m = results['metrics'][d]
        print(f"{d:>6} | {m['r_squared']:>8.4f} | {m['adj_r_squared']:>8.4f} | "
              f"{m['aic']:>10.2f} | {m['bic']:>10.2f} | "
              f"{m['cv_error']:>10.4f} | {m['cv_se']:>8.4f}")
    
    # Recommendations
    print("
--- Method Recommendations ---")
    recs = results['recommendations']
    print(f"  Cross-validation (minimum): degree {recs['cv_minimum']}")
    print(f"  Cross-validation (1-SE):    degree {recs['cv_one_se']}")
    print(f"  AIC:                        degree {recs['aic']}")
    print(f"  BIC:                        degree {recs['bic']}")
    print(f"  Adjusted R²:                degree {recs['adj_r_squared']}")
    
    print(f"
  >>> CONSENSUS: degree {results['consensus']} <<<")
    print("=" * 80)
 
# Application to real-world-like data
np.random.seed(123)
x = np.linspace(0, 10, 80)
# True relationship: quadratic with diminishing returns
y_true = 5 + 3*x - 0.15*x**2
y = y_true + np.random.normal(0, 1.5, len(x))
 
results = comprehensive_degree_selection(x, y, max_degree=8)
print_selection_report(results)

Interpretation

Summary: Degree Selection

We've developed a comprehensive toolkit for polynomial degree selection. Let's consolidate:

Key Takeaways

•Bias-variance tradeoff governs degree selection: too low = high bias, too high = high variance. Optimal degree minimizes their sum.
•Cross-validation directly estimates out-of-sample error; the one-SE rule provides robustness by favoring simpler models.
•AIC targets prediction accuracy; BIC targets true model identification and penalizes complexity more heavily.
•F-tests provide formal hypothesis testing for comparing nested polynomial models, but suffer from multiple testing issues.
•Practical heuristics (visual inspection, domain knowledge, data-to-parameter ratios) complement formal methods.
•Consensus across methods provides the most robust degree selection; disagreement signals need for deeper analysis.

What's Next:

Page Complete