Loading content...
In polynomial regression, the degree $d$ is the most critical hyperparameter. It controls the model's capacity to capture complex patterns—but as with Goldilocks, we need the choice to be 'just right.'
Consider fitting a polynomial to noisy data:
This page develops rigorous, principled methods for finding that optimal degree—moving beyond intuition to systematic model selection.
By the end of this page, you will master: (1) Cross-validation for degree selection with proper implementation; (2) Information criteria (AIC, BIC) and their mathematical foundations; (3) Hypothesis testing for comparing polynomial models; (4) The bias-variance decomposition applied to polynomial degree; (5) Practical heuristics and rules of thumb.
The Expected Prediction Error:
For a new test point $x_0$ with true value $y_0 = f(x_0) + \epsilon$ (where $\epsilon$ is noise with variance $\sigma^2$), the expected squared error can be decomposed:
$$\mathbb{E}[(y_0 - \hat{f}(x_0))^2] = \text{Bias}^2[\hat{f}(x_0)] + \text{Var}[\hat{f}(x_0)] + \sigma^2$$
As polynomial degree $d$ increases:
| Component | Effect of Increasing $d$ |
|---|---|
| Bias² | Decreases (more flexibility to match true $f$) |
| Variance | Increases (more parameters → more sensitivity to training noise) |
| Irreducible error $\sigma^2$ | Unchanged (inherent noise) |
The optimal degree minimizes the sum Bias² + Variance, not either component alone.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npimport matplotlib.pyplot as pltfrom typing import Tuple, List def bias_variance_simulation(true_func, x_range: Tuple[float, float], n_train: int, noise_std: float, degrees: List[int], n_simulations: int = 200, n_test: int = 50) -> dict: """ Empirically estimate bias and variance for different polynomial degrees. Process: 1. Generate many training datasets from the same true function + noise 2. Fit each polynomial degree to each dataset 3. Evaluate predictions at fixed test points 4. Compute empirical bias and variance across simulations """ x_min, x_max = x_range # Fixed test points for evaluation x_test = np.linspace(x_min, x_max, n_test) y_true = true_func(x_test) results = {d: {'predictions': np.zeros((n_simulations, n_test))} for d in degrees} for sim in range(n_simulations): # Generate training data x_train = np.random.uniform(x_min, x_max, n_train) y_train = true_func(x_train) + np.random.normal(0, noise_std, n_train) # Fit each degree for d in degrees: # Center and scale x_mean, x_std = np.mean(x_train), np.std(x_train) if x_std < 1e-10: x_std = 1.0 z_train = (x_train - x_mean) / x_std z_test = (x_test - x_mean) / x_std # Fit polynomial Phi_train = np.column_stack([z_train**k for k in range(d + 1)]) Phi_test = np.column_stack([z_test**k for k in range(d + 1)]) try: beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0] y_pred = Phi_test @ beta except: y_pred = np.full(n_test, np.nan) results[d]['predictions'][sim, :] = y_pred # Compute bias and variance at each test point for d in degrees: preds = results[d]['predictions'] mean_pred = np.nanmean(preds, axis=0) # E[f_hat(x)] # Bias = E[f_hat(x)] - f(x) bias = mean_pred - y_true bias_squared = np.nanmean(bias**2) # Average over test points # Variance = E[(f_hat(x) - E[f_hat(x)])^2] variance = np.nanmean(np.nanvar(preds, axis=0)) # Total expected error (averaged over simulations and test points) mse = np.nanmean((preds - y_true)**2) results[d]['bias_squared'] = bias_squared results[d]['variance'] = variance results[d]['mse'] = mse results[d]['irreducible'] = noise_std**2 return results # True function: cubictrue_func = lambda x: 0.5 * x**3 - 2 * x**2 + x + 1 # Run simulationdegrees = [1, 2, 3, 4, 5, 7, 10, 15]results = bias_variance_simulation( true_func, x_range=(-2, 3), n_train=30, noise_std=1.0, degrees=degrees, n_simulations=200) # Display resultsprint("Bias-Variance Decomposition by Polynomial Degree")print("True function: 0.5x³ - 2x² + x + 1 (cubic)")print("=" * 70)print(f"{'Degree':>8} | {'Bias²':>12} | {'Variance':>12} | {'MSE':>12} | {'Optimal?':>10}")print("-" * 70) min_mse = min(r['mse'] for r in results.values())for d in degrees: r = results[d] is_optimal = " ✓" if abs(r['mse'] - min_mse) < 0.1 else "" print(f"{d:>8} | {r['bias_squared']:>12.4f} | {r['variance']:>12.4f} | " f"{r['mse']:>12.4f} | {is_optimal:>10}")In the simulation above, the true function is cubic (degree 3). Notice that degrees 1-2 have high bias (can't capture cubic shape), degrees 3-4 have optimal MSE (bias low, variance controlled), and degrees 7+ show exploding variance despite zero bias. The optimal degree captures true complexity without inviting unnecessary variance.
The Core Idea:
Cross-validation (CV) estimates how well a model generalizes to unseen data by systematically holding out portions of the training data for validation. For polynomial degree selection, we fit polynomials of different degrees and choose the one with lowest CV error.
K-Fold Cross-Validation:
The Mathematical Formulation:
Let $\kappa: \{1, \ldots, n\} \to \{1, \ldots, K\}$ be the fold assignment. The CV error for degree $d$ is:
$$\text{CV}K(d) = \frac{1}{n} \sum{i=1}^{n} \left( y_i - \hat{f}_{-\kappa(i)}^{(d)}(x_i) \right)^2$$
where $\hat{f}_{-\kappa(i)}^{(d)}$ is the degree-$d$ polynomial fitted on all data except fold $\kappa(i)$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom typing import Tuple, List def k_fold_cv_polynomial(x: np.ndarray, y: np.ndarray, degrees: List[int], k: int = 5, random_state: int = None) -> dict: """ Perform K-fold cross-validation to select optimal polynomial degree. Parameters: x, y: Training data degrees: List of polynomial degrees to evaluate k: Number of folds random_state: Random seed for reproducibility Returns: Dictionary with CV errors, standard errors, and best degree """ n = len(x) if random_state is not None: np.random.seed(random_state) # Random fold assignment indices = np.random.permutation(n) fold_sizes = np.full(k, n // k) fold_sizes[:n % k] += 1 # Distribute remainder fold_indices = [] current = 0 for size in fold_sizes: fold_indices.append(indices[current:current + size]) current += size results = {d: {'fold_errors': []} for d in degrees} for d in degrees: for fold_idx in range(k): # Train on all folds except fold_idx val_idx = fold_indices[fold_idx] train_idx = np.concatenate([fold_indices[i] for i in range(k) if i != fold_idx]) x_train, y_train = x[train_idx], y[train_idx] x_val, y_val = x[val_idx], y[val_idx] # Center and scale based on training data only x_mean, x_std = np.mean(x_train), np.std(x_train) if x_std < 1e-10: x_std = 1.0 z_train = (x_train - x_mean) / x_std z_val = (x_val - x_mean) / x_std # Fit polynomial Phi_train = np.column_stack([z_train**j for j in range(d + 1)]) Phi_val = np.column_stack([z_val**j for j in range(d + 1)]) try: beta = np.linalg.lstsq(Phi_train, y_train, rcond=None)[0] y_pred = Phi_val @ beta mse = np.mean((y_val - y_pred)**2) except: mse = np.inf results[d]['fold_errors'].append(mse) # Aggregate across folds fold_errors = results[d]['fold_errors'] results[d]['cv_error'] = np.mean(fold_errors) results[d]['cv_std'] = np.std(fold_errors) / np.sqrt(k) # Std error of mean # Find optimal degree cv_errors = {d: results[d]['cv_error'] for d in degrees} best_degree = min(cv_errors, key=cv_errors.get) # One-standard-error rule: simplest model within 1 SE of best best_cv_error = cv_errors[best_degree] threshold = best_cv_error + results[best_degree]['cv_std'] one_se_degree = min(d for d in sorted(degrees) if cv_errors[d] <= threshold) return { 'detailed': results, 'best_degree': best_degree, 'one_se_degree': one_se_degree, 'cv_errors': cv_errors } # Example usagenp.random.seed(42)x = np.linspace(-3, 3, 60)y_true = 0.3 * x**3 - 0.8 * x**2 + 0.5 * x + 2y = y_true + np.random.normal(0, 1.5, len(x)) degrees = list(range(1, 12))cv_results = k_fold_cv_polynomial(x, y, degrees, k=5, random_state=42) print("K-Fold Cross-Validation Results (K=5)")print("True function: cubic")print("=" * 60)print(f"{'Degree':>8} | {'CV Error':>12} | {'± Std Err':>12} | {'Status':>12}")print("-" * 60) for d in degrees: r = cv_results['detailed'][d] status = "" if d == cv_results['best_degree']: status = "BEST" elif d == cv_results['one_se_degree']: status = "1-SE RULE" print(f"{d:>8} | {r['cv_error']:>12.4f} | {r['cv_std']:>12.4f} | {status:>12}") print(f"Best degree (minimum CV): {cv_results['best_degree']}")print(f"One-SE rule degree: {cv_results['one_se_degree']}")The one-standard-error rule is a powerful principle for model selection: choose the simplest model whose CV error is within one standard error of the minimum. This guards against overfitting to the validation sets themselves and typically yields more robust generalization. For polynomials, this often selects a degree 1-2 lower than the pure minimum CV choice.
Leave-One-Out Cross-Validation (LOOCV):
When $K = n$ (each fold contains exactly one observation), we get LOOCV. For polynomial regression specifically, LOOCV has a remarkable shortcut—we don't need to refit $n$ times!
The PRESS Statistic (Predicted Residual Error Sum of Squares):
$$\text{PRESS} = \sum_{i=1}^{n} \left( \frac{e_i}{1 - h_{ii}} \right)^2$$
where $e_i = y_i - \hat{y}i$ is the residual and $h{ii}$ is the $i$-th diagonal element of the hat matrix $\mathbf{H} = \mathbf{\Phi}(\mathbf{\Phi}^T \mathbf{\Phi})^{-1}\mathbf{\Phi}^T$.
This formula gives exact LOOCV error from a single model fit!
The Core Idea:
Information criteria provide a principled way to balance model fit against model complexity, without the computational cost of cross-validation. They estimate out-of-sample prediction error by adding a penalty for the number of parameters.
Akaike Information Criterion (AIC):
$$\text{AIC} = n \cdot \ln(\text{RSS}/n) + 2(d+2)$$
where $\text{RSS} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ is the residual sum of squares and $(d+2)$ is the number of parameters (including variance estimation).
Bayesian Information Criterion (BIC):
$$\text{BIC} = n \cdot \ln(\text{RSS}/n) + \ln(n) \cdot (d+2)$$
BIC penalizes complexity more heavily (especially for large $n$), favoring simpler models.
| Criterion | Penalty Term | Tendency | Best For |
|---|---|---|---|
| AIC | $2p$ | Slightly favors complex models | Prediction accuracy |
| BIC | $\ln(n) \cdot p$ | Strongly favors simpler models | Finding true model |
| AICc | $\frac{2p(p+1)}{n-p-1}$ correction | Corrects AIC for small samples | Small $n$ settings |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as npfrom typing import List def compute_information_criteria(x: np.ndarray, y: np.ndarray, degrees: List[int]) -> dict: """ Compute AIC, AICc, and BIC for polynomial models of various degrees. Returns dictionary with all criteria values and selected degrees. """ n = len(y) # Center and scale x_mean, x_std = np.mean(x), np.std(x) if x_std < 1e-10: x_std = 1.0 z = (x - x_mean) / x_std results = {} for d in degrees: # Fit model Phi = np.column_stack([z**k for k in range(d + 1)]) beta = np.linalg.lstsq(Phi, y, rcond=None)[0] y_pred = Phi @ beta # Residual sum of squares rss = np.sum((y - y_pred)**2) # Number of parameters (coefficients + variance) p = d + 2 # d+1 coefficients + 1 variance parameter # Log-likelihood term (proportional) ll_term = n * np.log(rss / n) # AIC aic = ll_term + 2 * p # AICc (corrected AIC for small samples) if n - p - 1 > 0: aicc_correction = (2 * p * (p + 1)) / (n - p - 1) aicc = aic + aicc_correction else: aicc = np.inf # BIC bic = ll_term + np.log(n) * p # R-squared for reference tss = np.sum((y - np.mean(y))**2) r_squared = 1 - rss / tss adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p) results[d] = { 'aic': aic, 'aicc': aicc, 'bic': bic, 'rss': rss, 'r_squared': r_squared, 'adj_r_squared': adj_r_squared } # Select best degree by each criterion best_aic = min(degrees, key=lambda d: results[d]['aic']) best_aicc = min(degrees, key=lambda d: results[d]['aicc']) best_bic = min(degrees, key=lambda d: results[d]['bic']) best_adj_r2 = max(degrees, key=lambda d: results[d]['adj_r_squared']) return { 'detailed': results, 'best_aic': best_aic, 'best_aicc': best_aicc, 'best_bic': best_bic, 'best_adj_r2': best_adj_r2 } # Example with synthetic datanp.random.seed(42)x = np.linspace(-3, 3, 50)y_true = 1.5 * x**2 - 0.5 * x + 2 # Quadraticy = y_true + np.random.normal(0, 1.5, len(x)) degrees = list(range(1, 10))ic_results = compute_information_criteria(x, y, degrees) print("Information Criteria for Polynomial Degree Selection")print("True function: quadratic (degree 2)")print("=" * 80)print(f"{'Degree':>6} | {'AIC':>10} | {'AICc':>10} | {'BIC':>10} | {'Adj R²':>10} | {'Status':>10}")print("-" * 80) for d in degrees: r = ic_results['detailed'][d] status = [] if d == ic_results['best_aic']: status.append("AIC") if d == ic_results['best_bic']: status.append("BIC") if d == ic_results['best_adj_r2']: status.append("Adj-R²") status_str = ", ".join(status) if status else "" print(f"{d:>6} | {r['aic']:>10.2f} | {r['aicc']:>10.2f} | " f"{r['bic']:>10.2f} | {r['adj_r_squared']:>10.4f} | {status_str:>10}") print(f"Recommendations:")print(f" AIC: degree {ic_results['best_aic']}")print(f" AICc: degree {ic_results['best_aicc']}")print(f" BIC: degree {ic_results['best_bic']}")print(f" Adj R²: degree {ic_results['best_adj_r2']}")Use AIC when prediction accuracy is paramount—you want the model that predicts best on new data, even if it's slightly more complex than necessary. Use BIC when you believe there's a 'true' underlying model and want to identify it—BIC is consistent (converges to true model as $n \to \infty$). When AIC and BIC disagree significantly, examine both models carefully.
The F-Test for Nested Models:
When comparing polynomial models of different degrees, we're comparing nested models—a degree-$d$ polynomial is nested within a degree-$(d+k)$ polynomial. The F-test provides a formal hypothesis test for whether the additional terms improve fit significantly.
Setup:
Null Hypothesis: The extra polynomial terms have zero coefficients (reduced model is sufficient).
F-Statistic: $$F = \frac{(\text{RSS}_1 - \text{RSS}_2) / (p_2 - p_1)}{\text{RSS}_2 / (n - p_2)}$$
Under the null hypothesis, $F \sim F_{p_2 - p_1, n - p_2}$ (F-distribution).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npfrom scipy import statsfrom typing import Tuple def f_test_polynomial(x: np.ndarray, y: np.ndarray, degree_reduced: int, degree_full: int, alpha: float = 0.05) -> dict: """ Perform F-test comparing two nested polynomial models. Parameters: x, y: Data arrays degree_reduced: Degree of simpler model degree_full: Degree of more complex model alpha: Significance level Returns: Dictionary with test statistics and decision """ if degree_reduced >= degree_full: raise ValueError("degree_reduced must be less than degree_full") n = len(y) # Center and scale x_mean, x_std = np.mean(x), np.std(x) if x_std < 1e-10: x_std = 1.0 z = (x - x_mean) / x_std # Fit reduced model Phi_reduced = np.column_stack([z**k for k in range(degree_reduced + 1)]) beta_reduced = np.linalg.lstsq(Phi_reduced, y, rcond=None)[0] rss_reduced = np.sum((y - Phi_reduced @ beta_reduced)**2) p_reduced = degree_reduced + 1 # Fit full model Phi_full = np.column_stack([z**k for k in range(degree_full + 1)]) beta_full = np.linalg.lstsq(Phi_full, y, rcond=None)[0] rss_full = np.sum((y - Phi_full @ beta_full)**2) p_full = degree_full + 1 # F-statistic df_num = p_full - p_reduced # Extra parameters in full model df_den = n - p_full # Residual degrees of freedom numerator = (rss_reduced - rss_full) / df_num denominator = rss_full / df_den f_stat = numerator / denominator p_value = 1 - stats.f.cdf(f_stat, df_num, df_den) f_critical = stats.f.ppf(1 - alpha, df_num, df_den) reject_null = p_value < alpha return { 'f_statistic': f_stat, 'p_value': p_value, 'f_critical': f_critical, 'df_numerator': df_num, 'df_denominator': df_den, 'rss_reduced': rss_reduced, 'rss_full': rss_full, 'reject_null': reject_null, 'conclusion': 'Full model significantly better' if reject_null else 'Reduced model sufficient' } def sequential_f_tests(x: np.ndarray, y: np.ndarray, max_degree: int, alpha: float = 0.05) -> int: """ Use sequential F-tests to select polynomial degree. Start with degree 1, keep adding terms until F-test fails. """ print(f"Sequential F-Tests for Polynomial Degree Selection (α = {alpha})") print("=" * 70) print(f"{'Comparison':>20} | {'F-stat':>10} | {'p-value':>10} | {'Decision':>20}") print("-" * 70) selected_degree = 1 for d in range(1, max_degree): result = f_test_polynomial(x, y, d, d + 1, alpha) comparison = f"degree {d} vs {d+1}" decision = "Add term" if result['reject_null'] else "Stop here" print(f"{comparison:>20} | {result['f_statistic']:>10.4f} | " f"{result['p_value']:>10.4f} | {decision:>20}") if result['reject_null']: selected_degree = d + 1 else: break # Higher terms not needed print(f"Selected degree: {selected_degree}") return selected_degree # Examplenp.random.seed(42)x = np.linspace(-3, 3, 80)y_true = 0.5 * x**3 - x**2 + 2 # Cubicy = y_true + np.random.normal(0, 1.5, len(x)) selected = sequential_f_tests(x, y, max_degree=8)Sequential F-tests suffer from multiple testing issues—if you perform many tests, the overall type I error rate exceeds $\alpha$. For formal model selection, information criteria (AIC/BIC) or cross-validation are often preferred. F-tests are most useful for testing specific hypotheses (e.g., 'Is the quadratic term necessary?').
Beyond formal methods, experienced practitioners develop intuition for polynomial degree selection. Here are battle-tested guidelines:
Let's work through a complete degree selection exercise using all the methods we've developed, demonstrating how to synthesize multiple lines of evidence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
import numpy as npfrom scipy import stats def comprehensive_degree_selection(x: np.ndarray, y: np.ndarray, max_degree: int = 10, k_folds: int = 5, alpha: float = 0.05) -> dict: """ Comprehensive polynomial degree selection using multiple methods. Combines: visual inspection, cross-validation, AIC, BIC, adjusted R². """ n = len(y) x_mean, x_std = np.mean(x), np.std(x) if x_std < 1e-10: x_std = 1.0 z = (x - x_mean) / x_std degrees = list(range(1, min(max_degree + 1, n // 3))) # Safety limit results = {'degrees': degrees, 'metrics': {}} for d in degrees: # Fit model Phi = np.column_stack([z**k for k in range(d + 1)]) beta = np.linalg.lstsq(Phi, y, rcond=None)[0] y_pred = Phi @ beta residuals = y - y_pred # Basic metrics rss = np.sum(residuals**2) tss = np.sum((y - np.mean(y))**2) p = d + 2 # Parameters including variance r2 = 1 - rss / tss adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p) # Information criteria ll_term = n * np.log(rss / n) aic = ll_term + 2 * p bic = ll_term + np.log(n) * p # Cross-validation (simplified k-fold) np.random.seed(42) # Reproducibility indices = np.random.permutation(n) fold_size = n // k_folds cv_errors = [] for k in range(k_folds): val_idx = indices[k * fold_size:(k + 1) * fold_size] train_idx = np.concatenate([indices[:k * fold_size], indices[(k + 1) * fold_size:]]) z_train, y_train = z[train_idx], y[train_idx] z_val, y_val = z[val_idx], y[val_idx] Phi_tr = np.column_stack([z_train**j for j in range(d + 1)]) Phi_val = np.column_stack([z_val**j for j in range(d + 1)]) beta_cv = np.linalg.lstsq(Phi_tr, y_train, rcond=None)[0] cv_pred = Phi_val @ beta_cv cv_errors.append(np.mean((y_val - cv_pred)**2)) cv_mean = np.mean(cv_errors) cv_se = np.std(cv_errors) / np.sqrt(k_folds) results['metrics'][d] = { 'r_squared': r2, 'adj_r_squared': adj_r2, 'aic': aic, 'bic': bic, 'cv_error': cv_mean, 'cv_se': cv_se, 'coefficients': beta } # Recommendations metrics = results['metrics'] best_cv = min(degrees, key=lambda d: metrics[d]['cv_error']) best_aic = min(degrees, key=lambda d: metrics[d]['aic']) best_bic = min(degrees, key=lambda d: metrics[d]['bic']) best_adj_r2 = max(degrees, key=lambda d: metrics[d]['adj_r_squared']) # One-SE rule cv_threshold = metrics[best_cv]['cv_error'] + metrics[best_cv]['cv_se'] one_se = min(d for d in degrees if metrics[d]['cv_error'] <= cv_threshold) results['recommendations'] = { 'cv_minimum': best_cv, 'cv_one_se': one_se, 'aic': best_aic, 'bic': best_bic, 'adj_r_squared': best_adj_r2 } # Consensus (most common recommendation) recs = list(results['recommendations'].values()) from collections import Counter counts = Counter(recs) results['consensus'] = counts.most_common(1)[0][0] return results def print_selection_report(results: dict): """Pretty-print the degree selection analysis.""" print("" + "=" * 80) print("POLYNOMIAL DEGREE SELECTION ANALYSIS") print("=" * 80) # Metrics table print("--- Performance Metrics by Degree ---") print(f"{'Degree':>6} | {'R²':>8} | {'Adj R²':>8} | {'AIC':>10} | " f"{'BIC':>10} | {'CV Error':>10} | {'CV SE':>8}") print("-" * 80) for d in results['degrees']: m = results['metrics'][d] print(f"{d:>6} | {m['r_squared']:>8.4f} | {m['adj_r_squared']:>8.4f} | " f"{m['aic']:>10.2f} | {m['bic']:>10.2f} | " f"{m['cv_error']:>10.4f} | {m['cv_se']:>8.4f}") # Recommendations print("--- Method Recommendations ---") recs = results['recommendations'] print(f" Cross-validation (minimum): degree {recs['cv_minimum']}") print(f" Cross-validation (1-SE): degree {recs['cv_one_se']}") print(f" AIC: degree {recs['aic']}") print(f" BIC: degree {recs['bic']}") print(f" Adjusted R²: degree {recs['adj_r_squared']}") print(f" >>> CONSENSUS: degree {results['consensus']} <<<") print("=" * 80) # Application to real-world-like datanp.random.seed(123)x = np.linspace(0, 10, 80)# True relationship: quadratic with diminishing returnsy_true = 5 + 3*x - 0.15*x**2y = y_true + np.random.normal(0, 1.5, len(x)) results = comprehensive_degree_selection(x, y, max_degree=8)print_selection_report(results)When methods agree (all recommending degree 2 for truly quadratic data), you can be confident in your choice. When they disagree, investigate why—perhaps the data has features that different criteria weight differently. The consensus approach provides a robust default.
We've developed a comprehensive toolkit for polynomial degree selection. Let's consolidate:
What's Next:
Degree selection addresses overfitting through model simplification. But high-degree polynomials remain useful when paired with regularization—adding penalty terms that shrink coefficients. The next page explores overfitting in polynomial regression more deeply and shows how regularization (Ridge, Lasso) provides an alternative to degree selection.
You now have a complete toolkit for polynomial degree selection: bias-variance analysis, cross-validation, information criteria, hypothesis testing, and practical guidelines. These methods apply broadly to any model selection problem involving complexity parameters.