Loading learning content...
In nonparametric regression, the bandwidth (also called smoothing parameter, window width, or span) is the single most important hyperparameter. It controls the bias-variance tradeoff that defines the estimator's behavior:
Unlike parametric models where the model itself is the main choice, nonparametric regression essentially has one degree of freedom—the bandwidth. Getting it right is everything.
This page develops principled methods for bandwidth selection, from theoretical optimal formulas to practical data-driven procedures. We'll see that while no method is perfect, understanding the landscape of options enables informed choices for any application.
By the end of this page, you will understand: (1) The theoretical optimal bandwidth and why it's impractical directly; (2) Leave-one-out and k-fold cross-validation for bandwidth selection; (3) Generalized cross-validation (GCV) as a computationally efficient alternative; (4) Plug-in bandwidth selectors; (5) Rule-of-thumb estimators; (6) Practical recommendations for different scenarios.
Mean Integrated Squared Error:
The standard criterion for bandwidth selection is the Mean Integrated Squared Error (MISE):
$$\text{MISE}(h) = E \left[ \int \left( \hat{f}(x; h) - f(x) \right)^2 dx \right]$$
This measures the expected average squared deviation of the estimator from the true function across the entire domain.
Asymptotic MISE for Nadaraya-Watson:
For the NW estimator with kernel $K$ having second moment $\mu_2$ and roughness $R(K)$:
$$\text{AMISE}(h) = \frac{h^4 \mu_2^2}{4} \int [f''(x)]^2 dx + \frac{\sigma^2 R(K)}{nh}$$
(Simplified, ignoring design effects for cleaner exposition.)
Optimal Bandwidth:
Minimizing AMISE with respect to $h$ (taking derivative, setting to zero):
$$\frac{\partial \text{AMISE}}{\partial h} = h^3 \mu_2^2 \int [f''(x)]^2 dx - \frac{\sigma^2 R(K)}{nh^2} = 0$$
Solving for $h$:
$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int [f''(x)]^2 dx} \right)^{1/5} n^{-1/5}$$
Key Insights:
The catch: We don't know $f''(x)$ or $\sigma^2$! These must be estimated, leading to plug-in methods.
The term $\int [f''(x)]^2 dx$ is called the roughness of $f$. It measures how much the function curves. Straight lines have zero roughness; wildly oscillating functions have high roughness. This roughness appears in optimal bandwidth formulas throughout nonparametric statistics.
Optimal MISE:
Substituting $h_{\text{opt}}$ back into AMISE:
$$\text{MISE}_{\text{opt}} = C(K) \cdot \left[ \sigma^2 \int [f''(x)]^2 dx \right]^{4/5} n^{-4/5}$$
where $C(K)$ is a constant depending only on the kernel.
This confirms the $n^{-4/5}$ convergence rate for nonparametric regression—slower than parametric rates but achievable with proper bandwidth selection.
Leave-One-Out Cross-Validation (LOOCV):
The most straightforward data-driven approach. For each candidate bandwidth $h$:
For each $i = 1, \ldots, n$:
Compute the cross-validation score: $$\text{CV}(h) = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{f}_{-i}(x_i; h) \right)^2$$
Choose $h$ minimizing $\text{CV}(h)$
Why it works: CV(h) is an approximately unbiased estimate of the prediction error at new points. Minimizing it approximately minimizes MISE.
Efficient Computation via the Smoother Matrix:
Naïvely, LOOCV requires $n$ separate fits—expensive for large $n$. The key insight is that for linear smoothers (including kernel smoothers and local polynomial regression), we can express:
$$\hat{\mathbf{y}} = \mathbf{S} \mathbf{y}$$
where $\mathbf{S}$ is the smoother matrix (also called the hat matrix). Entry $S_{ij}$ gives the weight that observation $j$ contributes to the prediction at observation $i$.
The LOOCV shortcut:
$$\hat{f}{-i}(x_i) = \frac{\hat{f}(x_i) - S{ii} y_i}{1 - S_{ii}}$$
This allows computing all leave-one-out predictions from a single fit:
$$\text{CV}(h) = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{f}(x_i; h)}{1 - S_{ii}} \right)^2$$
Complexity: $O(n^2)$ to compute $\mathbf{S}$, then $O(n)$ for CV score—much better than $O(n^3)$ for naïve approach.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom typing import Tuple, Callable, List def compute_smoother_matrix( x: np.ndarray, h: float, kernel: Callable = None, degree: int = 1) -> np.ndarray: """ Compute the smoother matrix S such that y_hat = S @ y. For local polynomial regression of given degree. """ if kernel is None: kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1) # Epanechnikov x = np.asarray(x).flatten() n = len(x) S = np.zeros((n, n)) for j in range(n): x0 = x[j] # Kernel weights u = (x - x0) / h w = kernel(u) # Local design matrix X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)]) # Weighted least squares W = np.diag(w) XtWX = X_local.T @ W @ X_local try: # S[j,:] = e_1^T (X'WX)^{-1} X'W XtWX_inv = np.linalg.inv(XtWX + 1e-10 * np.eye(degree + 1)) S[j, :] = (X_local[0:1, :] @ XtWX_inv @ X_local.T @ W)[0] # Correction: row j gets the weights for predicting at x[j] e1 = np.zeros((degree + 1, 1)) e1[0] = 1 S[j, :] = (e1.T @ XtWX_inv @ X_local.T @ W).flatten() except np.linalg.LinAlgError: S[j, :] = w / (np.sum(w) + 1e-10) return S def loocv_score( x: np.ndarray, y: np.ndarray, h: float, kernel: Callable = None, degree: int = 1) -> float: """ Compute leave-one-out cross-validation score for given bandwidth. Uses efficient computation via smoother matrix. """ S = compute_smoother_matrix(x, h, kernel, degree) y_hat = S @ y # LOOCV formula: sum((y - y_hat)/(1 - S_ii))^2 / n diagonal = np.diag(S) # Prevent division by zero diagonal = np.clip(diagonal, -np.inf, 0.999) residuals = y - y_hat loocv_residuals = residuals / (1 - diagonal) return np.mean(loocv_residuals**2) def select_bandwidth_cv( x: np.ndarray, y: np.ndarray, h_candidates: np.ndarray, kernel: Callable = None, degree: int = 1) -> Tuple[float, np.ndarray]: """ Select bandwidth by minimizing LOOCV score. Returns: h_opt: Optimal bandwidth cv_scores: CV scores for all candidates """ cv_scores = np.array([ loocv_score(x, y, h, kernel, degree) for h in h_candidates ]) h_opt = h_candidates[np.argmin(cv_scores)] return h_opt, cv_scores # =============================================================================# Demonstration# =============================================================================np.random.seed(42) # Generate datan = 100x = np.sort(np.random.uniform(0, 2*np.pi, n))f_true = lambda x: np.sin(x)y = f_true(x) + np.random.normal(0, 0.3, n) # Candidate bandwidthsh_range = np.linspace(0.1, 2.0, 30) # Run CVh_opt, cv_scores = select_bandwidth_cv(x, y, h_range, degree=1) print("Bandwidth Selection via LOOCV")print("=" * 50)print(f"Optimal bandwidth: h = {h_opt:.3f}")print(f"Minimum CV score: {cv_scores.min():.6f}")print(f"Top 5 bandwidths by CV score:")sorted_idx = np.argsort(cv_scores)[:5]for i, idx in enumerate(sorted_idx): print(f" {i+1}. h = {h_range[idx]:.3f}, CV = {cv_scores[idx]:.6f}")Cross-validation can be variable, especially for small samples. The CV curve may have multiple local minima or a flat minimum region. For more stable selection: (1) Average over multiple CV splits; (2) Use k-fold CV instead of LOOCV; (3) Apply smoothing to the CV curve before finding the minimum; (4) Consider the 'one standard error rule' (choose largest $h$ within one SE of the minimum).
From LOOCV to GCV:
LOOCV requires computing (or accessing) the diagonal elements $S_{ii}$ of the smoother matrix. For very large $n$, even storing $\mathbf{S}$ may be prohibitive.
Generalized Cross-Validation (Craven & Wahba, 1979) approximates LOOCV by replacing each $S_{ii}$ with their average:
$$\text{GCV}(h) = \frac{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2}{\left( 1 - \frac{\text{tr}(\mathbf{S})}{n} \right)^2}$$
where $\text{tr}(\mathbf{S}) = \sum_{i=1}^{n} S_{ii}$ is the effective degrees of freedom (EDF) of the smoother.
Interpretation:
GCV inflates training error based on model complexity, approximating true prediction error.
Effective Degrees of Freedom:
For linear smoothers, the effective degrees of freedom
$$\text{EDF}(h) = \text{tr}(\mathbf{S})$$
measures model complexity:
For OLS regression with $p$ predictors, $\text{tr}(\mathbf{H}) = p$ (exactly). For nonparametric smoothers, EDF is continuous in $h$, providing a smooth complexity measure.
Properties of GCV:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom typing import Tuple def compute_edf(S: np.ndarray) -> float: """Effective degrees of freedom = trace of smoother matrix.""" return np.trace(S) def gcv_score( y: np.ndarray, y_hat: np.ndarray, edf: float) -> float: """ Generalized Cross-Validation score. GCV = MSE / (1 - edf/n)^2 """ n = len(y) mse = np.mean((y - y_hat)**2) # Prevent division by zero complexity_penalty = (1 - edf / n)**2 if complexity_penalty < 1e-10: return np.inf return mse / complexity_penalty def select_bandwidth_gcv( x: np.ndarray, y: np.ndarray, h_candidates: np.ndarray, kernel = None, degree: int = 1) -> Tuple[float, np.ndarray, np.ndarray]: """ Select bandwidth by minimizing GCV score. Returns: h_opt: Optimal bandwidth gcv_scores: GCV scores for all candidates edf_values: Effective degrees of freedom for all candidates """ if kernel is None: kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1) x = np.asarray(x).flatten() y = np.asarray(y).flatten() n = len(x) gcv_scores = [] edf_values = [] for h in h_candidates: # Compute smoother matrix and predictions S = np.zeros((n, n)) for j in range(n): x0 = x[j] u = (x - x0) / h w = kernel(u) X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)]) W = np.diag(w) XtWX = X_local.T @ W @ X_local try: XtWX_inv = np.linalg.inv(XtWX + 1e-10 * np.eye(degree + 1)) e1 = np.zeros(degree + 1) e1[0] = 1 S[j, :] = e1 @ XtWX_inv @ X_local.T @ W except: S[j, :] = w / (np.sum(w) + 1e-10) y_hat = S @ y edf = compute_edf(S) gcv = gcv_score(y, y_hat, edf) gcv_scores.append(gcv) edf_values.append(edf) gcv_scores = np.array(gcv_scores) edf_values = np.array(edf_values) h_opt = h_candidates[np.argmin(gcv_scores)] return h_opt, gcv_scores, edf_values # =============================================================================# Demonstration: Compare CV and GCV# =============================================================================np.random.seed(42) n = 100x = np.sort(np.random.uniform(0, 2*np.pi, n))y = np.sin(x) + np.random.normal(0, 0.3, n) h_range = np.linspace(0.15, 1.5, 25) # GCV selectionh_gcv, gcv_scores, edf_values = select_bandwidth_gcv(x, y, h_range, degree=1) print("GCV Bandwidth Selection")print("=" * 60)print(f"Optimal bandwidth (GCV): h = {h_gcv:.3f}")print(f"")print("Bandwidth vs. Effective Degrees of Freedom:")print("-" * 60)print(f"{'h':>8} | {'EDF':>8} | {'GCV':>12} | {'Complexity'}")print("-" * 60)for h, edf, gcv in list(zip(h_range, edf_values, gcv_scores))[::5]: complexity = "High" if edf > 30 else ("Medium" if edf > 15 else "Low") print(f"{h:>8.3f} | {edf:>8.2f} | {gcv:>12.6f} | {complexity}")The Plug-in Idea:
Recall the optimal bandwidth formula:
$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int [f''(x)]^2 dx} \right)^{1/5} n^{-1/5}$$
The kernel constants $R(K)$ and $\mu_2$ are known. We need to estimate:
Plug-in approach: Estimate these quantities and substitute into the formula.
Step 1: Estimate $\sigma^2$
Use residuals from a pilot fit (e.g., with a larger bandwidth): $$\hat{\sigma}^2 = \frac{1}{n - \text{EDF}} \sum_{i=1}^{n} (y_i - \hat{f}_{\text{pilot}}(x_i))^2$$
Step 2: Estimate $\int [f''(x)]^2 dx$
This is the tricky part. We must estimate the second derivative of a function we don't know!
Approach 1: Parametric pilot
Approach 2: Nonparametric pilot
Approach 3: Two-stage plug-in (Sheather-Jones style)
bw.nrd() in R123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom scipy import stats, integrate def plugin_bandwidth_normal( x: np.ndarray, kernel: str = 'epanechnikov') -> float: """ Rule-of-thumb plug-in bandwidth assuming underlying Gaussian distribution. This is the 'normal reference rule' or 'Silverman's rule'. Optimal for Gaussian data + Gaussian kernel, but provides a reasonable starting point for other smooth unimodal distributions. """ x = np.asarray(x).flatten() n = len(x) # Estimate scale using robust IQR or std sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349) # Kernel-specific constant # For Gaussian kernel targeting density estimation: # h_opt = 1.06 * sigma * n^{-1/5} # For regression, rates differ; we adjust kernel_constants = { 'gaussian': 1.06, 'epanechnikov': 2.34, # Adjusted for bounded support 'quartic': 2.62, } C = kernel_constants.get(kernel, 1.06) # Note: This is for density estimation; for regression, theory suggests # h ~ C * sigma * n^{-1/5} as well, but constants differ h = C * sigma * n**(-1/5) return h def plugin_bandwidth_regression( x: np.ndarray, y: np.ndarray, pilot_degree: int = 4) -> float: """ Plug-in bandwidth selector for regression using polynomial pilot. 1. Fit high-degree polynomial to estimate f'' 2. Compute roughness integral 3. Estimate noise variance 4. Plug into optimal bandwidth formula """ x = np.asarray(x).flatten() y = np.asarray(y).flatten() n = len(x) # Standardize x to [0, 1] for numerical stability x_min, x_max = x.min(), x.max() x_range = x_max - x_min x_std = (x - x_min) / x_range # Step 1: Fit polynomial pilot # Use orthogonal polynomial basis for stability coeffs = np.polyfit(x_std, y, pilot_degree) poly = np.poly1d(coeffs) # Step 2: Compute second derivative poly_d2 = poly.deriv(2) # Step 3: Estimate roughness integral ∫[f''(x)]^2 dx # Integrate over [0, 1] (standardized domain) roughness_fn = lambda t: poly_d2(t)**2 roughness, _ = integrate.quad(roughness_fn, 0, 1) # Scale back to original x domain: ∫[f''(x)]^2 dx = roughness / x_range^3 # (Two derivatives each contribute factor 1/x_range) roughness_original = roughness / x_range**3 # Step 4: Estimate sigma^2 from residuals y_pilot = poly(x_std) residuals = y - y_pilot sigma2 = np.var(residuals, ddof=pilot_degree + 1) # Step 5: Plug-in formula (Epanechnikov kernel) # R(K) = 3/5, mu_2 = 1/5 for Epanechnikov R_K = 0.6 mu_2 = 0.2 if roughness_original < 1e-10: # Nearly linear function - use rule of thumb h = plugin_bandwidth_normal(x * x_range + x_min, 'epanechnikov') else: # Optimal bandwidth formula h_std = (sigma2 * R_K / (mu_2**2 * roughness_original))**(1/5) * n**(-1/5) h = h_std * x_range # Scale back to original domain return h # =============================================================================# Demonstration# =============================================================================np.random.seed(42) # Generate data with known curvaturen = 150x = np.sort(np.random.uniform(0, 2*np.pi, n))f_true = lambda x: np.sin(x) + 0.3 * np.sin(3*x)y = f_true(x) + np.random.normal(0, 0.25, n) # Plugin bandwidthh_plugin = plugin_bandwidth_regression(x, y, pilot_degree=6)h_rot = plugin_bandwidth_normal(x, 'epanechnikov') print("Plug-in Bandwidth Selection")print("=" * 50)print(f"Plugin (polynomial pilot, d=6): h = {h_plugin:.4f}")print(f"Rule-of-thumb (normal reference): h = {h_rot:.4f}")print(f"")print("Note: Plugin bandwidth adapts to the function's curvature,")print(" while rule-of-thumb assumes Gaussian-like smoothness.")In practice, use established implementations: statsmodels.nonparametric.bandwidths (Python), bw.nrd() and bw.SJ() (R), or KernelDensity().bandwidth (scikit-learn for density). These handle edge cases and have been extensively validated.
When Quick Estimates Suffice:
For exploratory analysis or when computational resources are limited, rule-of-thumb bandwidth estimators provide reasonable starting points without iteration or cross-validation.
Silverman's Rule of Thumb (Normal Reference):
Assuming the true function has curvature similar to a Gaussian density:
$$h_{\text{rot}} = \left( \frac{4 \hat{\sigma}^5}{3n} \right)^{1/5} \approx 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$
where $\hat{\sigma}$ is a robust scale estimate: $$\hat{\sigma} = \min\left( s, \frac{\text{IQR}}{1.349} \right)$$
The IQR adjustment prevents inflation by outliers.
| Kernel | h_rot formula | Constant C |
|---|---|---|
| Gaussian | $C \cdot \hat{\sigma} \cdot n^{-1/5}$ | 1.06 |
| Epanechnikov | $C \cdot \hat{\sigma} \cdot n^{-1/5}$ | 2.34 |
| Quartic (Biweight) | $C \cdot \hat{\sigma} \cdot n^{-1/5}$ | 2.78 |
| Triangular | $C \cdot \hat{\sigma} \cdot n^{-1/5}$ | 2.58 |
Span-Based Rules for LOESS:
For LOESS with nearest-neighbor bandwidth, the span $\alpha$ (fraction of data) is the hyperparameter. Rules of thumb:
$$\alpha_{\text{default}} = 0.75 \quad \text{(Cleveland's original default)}$$
$$\alpha_{\text{adaptive}} = \max\left( 0.1, \min\left( 0.9, \frac{c \cdot (p+1)}{n} \right) \right)$$
where $p$ is the polynomial degree and $c \approx 3\text{-}5$ ensures enough points for stable local fits.
Scott's Rule (for density estimation, adaptable to regression): $$h = 1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$$
Terrell's Maximal Smoothing: $$h = 3 \cdot \hat{\sigma} \cdot n^{-1/5}$$
Provides an upper bound for bandwidth—more smoothing rarely helps beyond this.
Rule-of-thumb methods assume Gaussian-like smoothness. They can substantially oversmooth for highly curved or wiggly functions, and undersmooth for very smooth (nearly linear) functions. Use them for initial exploration, then refine with CV or visual inspection.
Head-to-Head Comparison:
Each bandwidth selection method has strengths and weaknesses. Understanding these helps you choose appropriately for your application.
| Method | Computation | When It Works Well | When It Struggles |
|---|---|---|---|
| LOOCV | O(n³) naïve, O(n²) with S | General purpose; automatic | Highly variable for small n; computationally expensive |
| GCV | O(n²) for S, O(n) otherwise | Large n; quick approximation to CV | Tends to undersmooth; needs smoother matrix |
| k-Fold CV | O(kn²/k²) ≈ O(n²) | Balance of speed and stability | Still variable; k is another hyperparameter |
| Plug-in | O(n) after pilot fit | Smooth, unimodal functions | Pilot model misspecification; bootstrap needed for SE |
| Rule-of-thumb | O(n) for scale estimate | Quick initial estimate; EDA | Ignores function-specific curvature; wrong for complex functions |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom scipy import stats def compare_bandwidth_methods( x: np.ndarray, y: np.ndarray, f_true: callable = None) -> dict: """ Compare different bandwidth selection methods on the same data. """ n = len(x) results = {} # 1. Rule of thumb (normal reference) sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349) h_rot = 2.34 * sigma * n**(-0.2) # Epanechnikov constant results['Rule-of-thumb'] = h_rot # 2. Simple plugin with polynomial pilot from numpy.polynomial import polynomial as P x_std = (x - x.min()) / (x.max() - x.min()) coeffs = np.polyfit(x_std, y, 4) poly = np.poly1d(coeffs) poly_d2 = poly.deriv(2) from scipy.integrate import quad roughness, _ = quad(lambda t: poly_d2(t)**2, 0, 1) roughness /= (x.max() - x.min())**3 residuals = y - poly(x_std) sigma2 = np.var(residuals, ddof=5) if roughness > 1e-10: h_plugin = (sigma2 * 0.6 / (0.04 * roughness))**0.2 * n**(-0.2) h_plugin *= (x.max() - x.min()) else: h_plugin = h_rot results['Plug-in (poly pilot)'] = h_plugin # 3. Cross-validation (simplified search) h_range = np.linspace(0.1 * (x.max() - x.min()), 0.5 * (x.max() - x.min()), 20) best_cv = np.inf h_cv = h_range[0] for h in h_range: # Simple NW for speed y_loo = np.zeros(n) for j in range(n): x0 = x[j] u = (x - x0) / h w = 0.75 * (1 - u**2) * (np.abs(u) < 1) w[j] = 0 # Leave out if np.sum(w) > 0: y_loo[j] = np.sum(w * y) / np.sum(w) else: y_loo[j] = np.mean(y) cv = np.mean((y - y_loo)**2) if cv < best_cv: best_cv = cv h_cv = h results['LOOCV'] = h_cv return results # =============================================================================# Run comparison on multiple scenarios# =============================================================================np.random.seed(42) scenarios = [ ("Sinusoidal", lambda x: np.sin(2*x), 0.3), ("Polynomial", lambda x: 0.5*x**2 - 2*x + 1, 0.3), ("Wiggly", lambda x: np.sin(x) + 0.5*np.sin(5*x), 0.3),] print("Bandwidth Selection Method Comparison")print("=" * 70) for name, f_true, noise in scenarios: n = 100 x = np.sort(np.random.uniform(0, 2*np.pi, n)) y = f_true(x) + np.random.normal(0, noise, n) results = compare_bandwidth_methods(x, y, f_true) print(f"Scenario: {name}") print("-" * 40) for method, h in results.items(): print(f" {method:<25}: h = {h:.4f}")Practical Recommendations:
For exploratory analysis:
For production/publication:
For very large n (n > 10,000):
When in doubt:
A Production-Ready Bandwidth Selector:
Combining the methods discussed, here's a robust implementation that handles edge cases and provides multiple estimates:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom typing import Dict, Optional, Tuplefrom dataclasses import dataclass @dataclassclass BandwidthResult: """Container for bandwidth selection results.""" h_optimal: float method: str cv_score: Optional[float] = None edf: Optional[float] = None h_alternatives: Optional[Dict[str, float]] = None def select_bandwidth( x: np.ndarray, y: np.ndarray, method: str = 'auto', kernel: str = 'epanechnikov', degree: int = 1, cv_folds: int = 5, n_candidates: int = 25) -> BandwidthResult: """ Select bandwidth for local polynomial regression. Parameters: x: Input values (n,) y: Target values (n,) method: 'auto', 'cv', 'gcv', 'plugin', or 'rot' kernel: 'epanechnikov', 'gaussian', or 'quartic' degree: Local polynomial degree (1 recommended) cv_folds: Number of CV folds (if using k-fold) n_candidates: Number of bandwidth values to search Returns: BandwidthResult with optimal bandwidth and diagnostics """ x = np.asarray(x).flatten() y = np.asarray(y).flatten() n = len(x) # Validate if n < 10: raise ValueError("Need at least 10 data points") x_range = x.max() - x.min() if x_range < 1e-10: raise ValueError("x values are constant or nearly constant") # Define kernel function kernels = { 'epanechnikov': lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1), 'gaussian': lambda u: np.exp(-u**2/2) / np.sqrt(2*np.pi), 'quartic': lambda u: (15/16) * (1-u**2)**2 * (np.abs(u) < 1), } K = kernels.get(kernel, kernels['epanechnikov']) # Compute all alternatives alternatives = {} # Rule of thumb sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349) if sigma < 1e-10: sigma = np.std(x) + 1e-10 rot_constants = {'epanechnikov': 2.34, 'gaussian': 1.06, 'quartic': 2.78} h_rot = rot_constants.get(kernel, 2.34) * sigma * n**(-0.2) alternatives['rot'] = h_rot # Define search range based on rule of thumb h_min = 0.1 * h_rot h_max = 3.0 * h_rot h_candidates = np.linspace(h_min, h_max, n_candidates) # Cross-validation def compute_cv_score(h): cv_error = 0 for j in range(n): x0 = x[j] u = (x - x0) / h w = K(u) w[j] = 0 # Leave one out if np.sum(w) > 1e-10: if degree == 0: y_pred = np.sum(w * y) / np.sum(w) else: X_loc = np.column_stack([(x - x0)**p for p in range(degree+1)]) W = np.diag(w) try: XtWX = X_loc.T @ W @ X_loc XtWy = X_loc.T @ W @ y beta = np.linalg.solve(XtWX + 1e-8*np.eye(degree+1), XtWy) y_pred = beta[0] except: y_pred = np.sum(w * y) / np.sum(w) else: y_pred = np.mean(y) cv_error += (y[j] - y_pred)**2 return cv_error / n cv_scores = np.array([compute_cv_score(h) for h in h_candidates]) h_cv = h_candidates[np.argmin(cv_scores)] alternatives['cv'] = h_cv # Plugin (simplified) x_std = (x - x.min()) / x_range coeffs = np.polyfit(x_std, y, min(4, n//10)) poly = np.poly1d(coeffs) poly_d2 = poly.deriv(2) from scipy.integrate import quad try: roughness, _ = quad(lambda t: poly_d2(t)**2, 0, 1, limit=100) roughness = max(roughness, 1e-10) / x_range**3 sigma2 = np.var(y - poly(x_std), ddof=len(coeffs)) h_plugin = (sigma2 * 0.6 / (0.04 * roughness))**0.2 * n**(-0.2) * x_range alternatives['plugin'] = np.clip(h_plugin, h_min, h_max) except: alternatives['plugin'] = h_rot # Select based on method if method == 'auto': # Use CV for n < 1000, GCV or simpler for larger if n < 1000: method = 'cv' else: method = 'rot' # Faster for large n h_optimal = alternatives.get(method, h_cv) return BandwidthResult( h_optimal=h_optimal, method=method, cv_score=cv_scores.min() if method == 'cv' else None, edf=None, # Would need smoother matrix h_alternatives=alternatives ) # =============================================================================# Example usage# =============================================================================np.random.seed(42) n = 200x = np.sort(np.random.uniform(0, 10, n))y = np.sin(x) + 0.3 * np.cos(3*x) + np.random.normal(0, 0.3, n) result = select_bandwidth(x, y, method='auto', degree=1) print("Bandwidth Selection Results")print("=" * 50)print(f"Optimal bandwidth: h = {result.h_optimal:.4f}")print(f"Selection method: {result.method}")print(f"CV score: {result.cv_score:.6f}" if result.cv_score else "")print(f"Alternative estimates:")for method, h in result.h_alternatives.items(): print(f" {method}: {h:.4f}")We've developed a comprehensive toolkit for bandwidth selection. Let's consolidate the key insights:
What's Next:
We've mastered one-dimensional nonparametric regression. But what happens in higher dimensions? The next page confronts the curse of dimensionality—the fundamental challenge that makes nonparametric methods increasingly difficult as the number of predictors grows.
You now have practical tools for bandwidth selection: cross-validation for accuracy, GCV for efficiency, plug-in for speed, and rule-of-thumb for quick exploration. Apply these judiciously based on your data size and computational constraints. Next, we confront the curse of dimensionality.