Machine LearningAdvanced Regression

Nonparametric Regression

LevelAdvanced

Duration120 mins

TopicAdvanced Regression

3 / 5

Bandwidth Selection

The Critical Hyperparameter

In nonparametric regression, the bandwidth (also called smoothing parameter, window width, or span) is the single most important hyperparameter. It controls the bias-variance tradeoff that defines the estimator's behavior:

Too small: The estimator tracks noise, producing jagged, high-variance fits
Too large: The estimator oversmooths, missing genuine features of the underlying function
Just right: The estimator captures the true signal while filtering noise

Unlike parametric models where the model itself is the main choice, nonparametric regression essentially has one degree of freedom—the bandwidth. Getting it right is everything.

This page develops principled methods for bandwidth selection, from theoretical optimal formulas to practical data-driven procedures. We'll see that while no method is perfect, understanding the landscape of options enables informed choices for any application.

What You Will Learn

By the end of this page, you will understand: (1) The theoretical optimal bandwidth and why it's impractical directly; (2) Leave-one-out and k-fold cross-validation for bandwidth selection; (3) Generalized cross-validation (GCV) as a computationally efficient alternative; (4) Plug-in bandwidth selectors; (5) Rule-of-thumb estimators; (6) Practical recommendations for different scenarios.

The Theoretical Optimal Bandwidth

Mean Integrated Squared Error:

The standard criterion for bandwidth selection is the Mean Integrated Squared Error (MISE):

$$\text{MISE}(h) = E \left[ \int \left( \hat{f}(x; h) - f(x) \right)^2 dx \right]$$

This measures the expected average squared deviation of the estimator from the true function across the entire domain.

Asymptotic MISE for Nadaraya-Watson:

For the NW estimator with kernel $K$ having second moment $\mu_2$ and roughness $R(K)$:

$$\text{AMISE}(h) = \frac{h^4 \mu_2^2}{4} \int [f''(x)]^2 dx + \frac{\sigma^2 R(K)}{nh}$$

(Simplified, ignoring design effects for cleaner exposition.)

Optimal Bandwidth:

Minimizing AMISE with respect to $h$ (taking derivative, setting to zero):

$$\frac{\partial \text{AMISE}}{\partial h} = h^3 \mu_2^2 \int [f''(x)]^2 dx - \frac{\sigma^2 R(K)}{nh^2} = 0$$

Solving for $h$:

$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int [f''(x)]^2 dx} \right)^{1/5} n^{-1/5}$$

Key Insights:

$n^{-1/5}$ rate: Optimal bandwidth shrinks slowly with sample size
Depends on $f''$: More curved functions need smaller bandwidth
Depends on $\sigma^2$: More noise allows larger bandwidth
Kernel dependence: Through $R(K)$ and $\mu_2$ (but these nearly cancel)

The catch: We don't know $f''(x)$ or $\sigma^2$! These must be estimated, leading to plug-in methods.

The Roughness Integral

The term $\int [f''(x)]^2 dx$ is called the roughness of $f$. It measures how much the function curves. Straight lines have zero roughness; wildly oscillating functions have high roughness. This roughness appears in optimal bandwidth formulas throughout nonparametric statistics.

Optimal MISE:

Substituting $h_{\text{opt}}$ back into AMISE:

$$\text{MISE}_{\text{opt}} = C(K) \cdot \left[ \sigma^2 \int [f''(x)]^2 dx \right]^{4/5} n^{-4/5}$$

where $C(K)$ is a constant depending only on the kernel.

This confirms the $n^{-4/5}$ convergence rate for nonparametric regression—slower than parametric rates but achievable with proper bandwidth selection.

Cross-Validation Methods

Leave-One-Out Cross-Validation (LOOCV):

The most straightforward data-driven approach. For each candidate bandwidth $h$:

For each $i = 1, \ldots, n$:
- Remove observation $(x_i, y_i)$
- Fit the smoother on remaining $n-1$ points
- Predict at $x_i$: $\hat{f}_{-i}(x_i; h)$
Compute the cross-validation score: $$\text{CV}(h) = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{f}_{-i}(x_i; h) \right)^2$$
Choose $h$ minimizing $\text{CV}(h)$

Why it works: CV(h) is an approximately unbiased estimate of the prediction error at new points. Minimizing it approximately minimizes MISE.

Efficient Computation via the Smoother Matrix:

Naïvely, LOOCV requires $n$ separate fits—expensive for large $n$. The key insight is that for linear smoothers (including kernel smoothers and local polynomial regression), we can express:

$$\hat{\mathbf{y}} = \mathbf{S} \mathbf{y}$$

where $\mathbf{S}$ is the smoother matrix (also called the hat matrix). Entry $S_{ij}$ gives the weight that observation $j$ contributes to the prediction at observation $i$.

The LOOCV shortcut:

$$\hat{f}{-i}(x_i) = \frac{\hat{f}(x_i) - S{ii} y_i}{1 - S_{ii}}$$

This allows computing all leave-one-out predictions from a single fit:

$$\text{CV}(h) = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{f}(x_i; h)}{1 - S_{ii}} \right)^2$$

Complexity: $O(n^2)$ to compute $\mathbf{S}$, then $O(n)$ for CV score—much better than $O(n^3)$ for naïve approach.

loocv_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Tuple, Callable, List
 
def compute_smoother_matrix(
    x: np.ndarray,
    h: float,
    kernel: Callable = None,
    degree: int = 1
) -> np.ndarray:
    """
    Compute the smoother matrix S such that y_hat = S @ y.
    
    For local polynomial regression of given degree.
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)  # Epanechnikov
    
    x = np.asarray(x).flatten()
    n = len(x)
    S = np.zeros((n, n))
    
    for j in range(n):
        x0 = x[j]
        
        # Kernel weights
        u = (x - x0) / h
        w = kernel(u)
        
        # Local design matrix
        X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
        
        # Weighted least squares
        W = np.diag(w)
        XtWX = X_local.T @ W @ X_local
        
        try:
            # S[j,:] = e_1^T (X'WX)^{-1} X'W
            XtWX_inv = np.linalg.inv(XtWX + 1e-10 * np.eye(degree + 1))
            S[j, :] = (X_local[0:1, :] @ XtWX_inv @ X_local.T @ W)[0]
            # Correction: row j gets the weights for predicting at x[j]
            e1 = np.zeros((degree + 1, 1))
            e1[0] = 1
            S[j, :] = (e1.T @ XtWX_inv @ X_local.T @ W).flatten()
        except np.linalg.LinAlgError:
            S[j, :] = w / (np.sum(w) + 1e-10)
    
    return S
 
def loocv_score(
    x: np.ndarray,
    y: np.ndarray,
    h: float,
    kernel: Callable = None,
    degree: int = 1
) -> float:
    """
    Compute leave-one-out cross-validation score for given bandwidth.
    Uses efficient computation via smoother matrix.
    """
    S = compute_smoother_matrix(x, h, kernel, degree)
    y_hat = S @ y
    
    # LOOCV formula: sum((y - y_hat)/(1 - S_ii))^2 / n
    diagonal = np.diag(S)
    
    # Prevent division by zero
    diagonal = np.clip(diagonal, -np.inf, 0.999)
    
    residuals = y - y_hat
    loocv_residuals = residuals / (1 - diagonal)
    
    return np.mean(loocv_residuals**2)
 
def select_bandwidth_cv(
    x: np.ndarray,
    y: np.ndarray,
    h_candidates: np.ndarray,
    kernel: Callable = None,
    degree: int = 1
) -> Tuple[float, np.ndarray]:
    """
    Select bandwidth by minimizing LOOCV score.
    
    Returns:
        h_opt: Optimal bandwidth
        cv_scores: CV scores for all candidates
    """
    cv_scores = np.array([
        loocv_score(x, y, h, kernel, degree) 
        for h in h_candidates
    ])
    
    h_opt = h_candidates[np.argmin(cv_scores)]
    return h_opt, cv_scores
 
# =============================================================================
# Demonstration
# =============================================================================
np.random.seed(42)
 
# Generate data
n = 100
x = np.sort(np.random.uniform(0, 2*np.pi, n))
f_true = lambda x: np.sin(x)
y = f_true(x) + np.random.normal(0, 0.3, n)
 
# Candidate bandwidths
h_range = np.linspace(0.1, 2.0, 30)
 
# Run CV
h_opt, cv_scores = select_bandwidth_cv(x, y, h_range, degree=1)
 
print("Bandwidth Selection via LOOCV")
print("=" * 50)
print(f"Optimal bandwidth: h = {h_opt:.3f}")
print(f"Minimum CV score: {cv_scores.min():.6f}")
print(f"
Top 5 bandwidths by CV score:")
sorted_idx = np.argsort(cv_scores)[:5]
for i, idx in enumerate(sorted_idx):
    print(f"  {i+1}. h = {h_range[idx]:.3f}, CV = {cv_scores[idx]:.6f}")

CV Variability

Cross-validation can be variable, especially for small samples. The CV curve may have multiple local minima or a flat minimum region. For more stable selection: (1) Average over multiple CV splits; (2) Use k-fold CV instead of LOOCV; (3) Apply smoothing to the CV curve before finding the minimum; (4) Consider the 'one standard error rule' (choose largest $h$ within one SE of the minimum).

Generalized Cross-Validation (GCV)

From LOOCV to GCV:

LOOCV requires computing (or accessing) the diagonal elements $S_{ii}$ of the smoother matrix. For very large $n$, even storing $\mathbf{S}$ may be prohibitive.

Generalized Cross-Validation (Craven & Wahba, 1979) approximates LOOCV by replacing each $S_{ii}$ with their average:

$$\text{GCV}(h) = \frac{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2}{\left( 1 - \frac{\text{tr}(\mathbf{S})}{n} \right)^2}$$

where $\text{tr}(\mathbf{S}) = \sum_{i=1}^{n} S_{ii}$ is the effective degrees of freedom (EDF) of the smoother.

Interpretation:

Numerator: Mean squared residual (training error)
Denominator: Penalty for complexity (effective df / n)

GCV inflates training error based on model complexity, approximating true prediction error.

Effective Degrees of Freedom:

For linear smoothers, the effective degrees of freedom

$$\text{EDF}(h) = \text{tr}(\mathbf{S})$$

measures model complexity:

Small $h$ (local): Large EDF, fits data closely
Large $h$ (global): Small EDF, smooth fit

For OLS regression with $p$ predictors, $\text{tr}(\mathbf{H}) = p$ (exactly). For nonparametric smoothers, EDF is continuous in $h$, providing a smooth complexity measure.

Properties of GCV:

Asymptotically equivalent to LOOCV under mild conditions
Rotation invariant: Doesn't depend on coordinate system
Only needs tr(S): Often computable without full matrix
Tends to undersmooth slightly: May pick bandwidth slightly too small

gcv_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from typing import Tuple
 
def compute_edf(S: np.ndarray) -> float:
    """Effective degrees of freedom = trace of smoother matrix."""
    return np.trace(S)
 
def gcv_score(
    y: np.ndarray,
    y_hat: np.ndarray,
    edf: float
) -> float:
    """
    Generalized Cross-Validation score.
    
    GCV = MSE / (1 - edf/n)^2
    """
    n = len(y)
    mse = np.mean((y - y_hat)**2)
    
    # Prevent division by zero
    complexity_penalty = (1 - edf / n)**2
    if complexity_penalty < 1e-10:
        return np.inf
    
    return mse / complexity_penalty
 
def select_bandwidth_gcv(
    x: np.ndarray,
    y: np.ndarray,
    h_candidates: np.ndarray,
    kernel = None,
    degree: int = 1
) -> Tuple[float, np.ndarray, np.ndarray]:
    """
    Select bandwidth by minimizing GCV score.
    
    Returns:
        h_opt: Optimal bandwidth
        gcv_scores: GCV scores for all candidates
        edf_values: Effective degrees of freedom for all candidates
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    gcv_scores = []
    edf_values = []
    
    for h in h_candidates:
        # Compute smoother matrix and predictions
        S = np.zeros((n, n))
        
        for j in range(n):
            x0 = x[j]
            u = (x - x0) / h
            w = kernel(u)
            
            X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
            W = np.diag(w)
            XtWX = X_local.T @ W @ X_local
            
            try:
                XtWX_inv = np.linalg.inv(XtWX + 1e-10 * np.eye(degree + 1))
                e1 = np.zeros(degree + 1)
                e1[0] = 1
                S[j, :] = e1 @ XtWX_inv @ X_local.T @ W
            except:
                S[j, :] = w / (np.sum(w) + 1e-10)
        
        y_hat = S @ y
        edf = compute_edf(S)
        gcv = gcv_score(y, y_hat, edf)
        
        gcv_scores.append(gcv)
        edf_values.append(edf)
    
    gcv_scores = np.array(gcv_scores)
    edf_values = np.array(edf_values)
    
    h_opt = h_candidates[np.argmin(gcv_scores)]
    
    return h_opt, gcv_scores, edf_values
 
# =============================================================================
# Demonstration: Compare CV and GCV
# =============================================================================
np.random.seed(42)
 
n = 100
x = np.sort(np.random.uniform(0, 2*np.pi, n))
y = np.sin(x) + np.random.normal(0, 0.3, n)
 
h_range = np.linspace(0.15, 1.5, 25)
 
# GCV selection
h_gcv, gcv_scores, edf_values = select_bandwidth_gcv(x, y, h_range, degree=1)
 
print("GCV Bandwidth Selection")
print("=" * 60)
print(f"Optimal bandwidth (GCV): h = {h_gcv:.3f}")
print(f"")
print("Bandwidth vs. Effective Degrees of Freedom:")
print("-" * 60)
print(f"{'h':>8} | {'EDF':>8} | {'GCV':>12} | {'Complexity'}")
print("-" * 60)
for h, edf, gcv in list(zip(h_range, edf_values, gcv_scores))[::5]:
    complexity = "High" if edf > 30 else ("Medium" if edf > 15 else "Low")
    print(f"{h:>8.3f} | {edf:>8.2f} | {gcv:>12.6f} | {complexity}")

Plug-in Bandwidth Selectors

The Plug-in Idea:

Recall the optimal bandwidth formula:

$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int [f''(x)]^2 dx} \right)^{1/5} n^{-1/5}$$

The kernel constants $R(K)$ and $\mu_2$ are known. We need to estimate:

$\sigma^2$: The residual variance
$\int [f''(x)]^2 dx$: The integrated squared second derivative (roughness)

Plug-in approach: Estimate these quantities and substitute into the formula.

Step 1: Estimate $\sigma^2$

Use residuals from a pilot fit (e.g., with a larger bandwidth): $$\hat{\sigma}^2 = \frac{1}{n - \text{EDF}} \sum_{i=1}^{n} (y_i - \hat{f}_{\text{pilot}}(x_i))^2$$

Step 2: Estimate $\int [f''(x)]^2 dx$

This is the tricky part. We must estimate the second derivative of a function we don't know!

Approach 1: Parametric pilot

Fit a parametric model (e.g., polynomial of degree 4-6)
Compute $f''$ analytically
Integrate numerically

Approach 2: Nonparametric pilot

Use a kernel smoother with a preliminary bandwidth $g$
Estimate $f''$ from the smoother (requires special kernels or local quadratic fits)
But this requires choosing $g$... leading to a chicken-and-egg problem

Approach 3: Two-stage plug-in (Sheather-Jones style)

Use asymptotic formulas that relate optimal $g$ to the bandwidth $h$ we're trying to find
Iterate or solve coupled equations
This is the basis of modern plug-in methods like bw.nrd() in R

plugin_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy import stats, integrate
 
def plugin_bandwidth_normal(
    x: np.ndarray,
    kernel: str = 'epanechnikov'
) -> float:
    """
    Rule-of-thumb plug-in bandwidth assuming underlying Gaussian distribution.
    
    This is the 'normal reference rule' or 'Silverman's rule'.
    Optimal for Gaussian data + Gaussian kernel, but provides a 
    reasonable starting point for other smooth unimodal distributions.
    """
    x = np.asarray(x).flatten()
    n = len(x)
    
    # Estimate scale using robust IQR or std
    sigma = min(np.std(x), 
                (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349)
    
    # Kernel-specific constant
    # For Gaussian kernel targeting density estimation:
    # h_opt = 1.06 * sigma * n^{-1/5}
    # For regression, rates differ; we adjust
    
    kernel_constants = {
        'gaussian': 1.06,
        'epanechnikov': 2.34,  # Adjusted for bounded support
        'quartic': 2.62,
    }
    
    C = kernel_constants.get(kernel, 1.06)
    
    # Note: This is for density estimation; for regression, theory suggests
    # h ~ C * sigma * n^{-1/5} as well, but constants differ
    h = C * sigma * n**(-1/5)
    
    return h
 
def plugin_bandwidth_regression(
    x: np.ndarray,
    y: np.ndarray,
    pilot_degree: int = 4
) -> float:
    """
    Plug-in bandwidth selector for regression using polynomial pilot.
    
    1. Fit high-degree polynomial to estimate f''
    2. Compute roughness integral
    3. Estimate noise variance
    4. Plug into optimal bandwidth formula
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    # Standardize x to [0, 1] for numerical stability
    x_min, x_max = x.min(), x.max()
    x_range = x_max - x_min
    x_std = (x - x_min) / x_range
    
    # Step 1: Fit polynomial pilot
    # Use orthogonal polynomial basis for stability
    coeffs = np.polyfit(x_std, y, pilot_degree)
    poly = np.poly1d(coeffs)
    
    # Step 2: Compute second derivative
    poly_d2 = poly.deriv(2)
    
    # Step 3: Estimate roughness integral ∫[f''(x)]^2 dx
    # Integrate over [0, 1] (standardized domain)
    roughness_fn = lambda t: poly_d2(t)**2
    roughness, _ = integrate.quad(roughness_fn, 0, 1)
    
    # Scale back to original x domain: ∫[f''(x)]^2 dx = roughness / x_range^3
    # (Two derivatives each contribute factor 1/x_range)
    roughness_original = roughness / x_range**3
    
    # Step 4: Estimate sigma^2 from residuals
    y_pilot = poly(x_std)
    residuals = y - y_pilot
    sigma2 = np.var(residuals, ddof=pilot_degree + 1)
    
    # Step 5: Plug-in formula (Epanechnikov kernel)
    # R(K) = 3/5, mu_2 = 1/5 for Epanechnikov
    R_K = 0.6
    mu_2 = 0.2
    
    if roughness_original < 1e-10:
        # Nearly linear function - use rule of thumb
        h = plugin_bandwidth_normal(x * x_range + x_min, 'epanechnikov')
    else:
        # Optimal bandwidth formula
        h_std = (sigma2 * R_K / (mu_2**2 * roughness_original))**(1/5) * n**(-1/5)
        h = h_std * x_range  # Scale back to original domain
    
    return h
 
# =============================================================================
# Demonstration
# =============================================================================
np.random.seed(42)
 
# Generate data with known curvature
n = 150
x = np.sort(np.random.uniform(0, 2*np.pi, n))
f_true = lambda x: np.sin(x) + 0.3 * np.sin(3*x)
y = f_true(x) + np.random.normal(0, 0.25, n)
 
# Plugin bandwidth
h_plugin = plugin_bandwidth_regression(x, y, pilot_degree=6)
h_rot = plugin_bandwidth_normal(x, 'epanechnikov')
 
print("Plug-in Bandwidth Selection")
print("=" * 50)
print(f"Plugin (polynomial pilot, d=6): h = {h_plugin:.4f}")
print(f"Rule-of-thumb (normal reference): h = {h_rot:.4f}")
print(f"")
print("Note: Plugin bandwidth adapts to the function's curvature,")
print("      while rule-of-thumb assumes Gaussian-like smoothness.")

Practical Plug-in Methods

In practice, use established implementations: statsmodels.nonparametric.bandwidths (Python), bw.nrd() and bw.SJ() (R), or KernelDensity().bandwidth (scikit-learn for density). These handle edge cases and have been extensively validated.

Rule-of-Thumb Estimators

When Quick Estimates Suffice:

For exploratory analysis or when computational resources are limited, rule-of-thumb bandwidth estimators provide reasonable starting points without iteration or cross-validation.

Silverman's Rule of Thumb (Normal Reference):

Assuming the true function has curvature similar to a Gaussian density:

$$h_{\text{rot}} = \left( \frac{4 \hat{\sigma}^5}{3n} \right)^{1/5} \approx 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$

where $\hat{\sigma}$ is a robust scale estimate: $$\hat{\sigma} = \min\left( s, \frac{\text{IQR}}{1.349} \right)$$

The IQR adjustment prevents inflation by outliers.

Rule-of-Thumb Bandwidths for Common Kernels
Kernel	h_rot formula	Constant C
Gaussian	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	1.06
Epanechnikov	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	2.34
Quartic (Biweight)	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	2.78
Triangular	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	2.58

Span-Based Rules for LOESS:

For LOESS with nearest-neighbor bandwidth, the span $\alpha$ (fraction of data) is the hyperparameter. Rules of thumb:

$$\alpha_{\text{default}} = 0.75 \quad \text{(Cleveland's original default)}$$

$$\alpha_{\text{adaptive}} = \max\left( 0.1, \min\left( 0.9, \frac{c \cdot (p+1)}{n} \right) \right)$$

where $p$ is the polynomial degree and $c \approx 3\text{-}5$ ensures enough points for stable local fits.

Scott's Rule (for density estimation, adaptable to regression): $$h = 1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$$

Terrell's Maximal Smoothing: $$h = 3 \cdot \hat{\sigma} \cdot n^{-1/5}$$

Provides an upper bound for bandwidth—more smoothing rarely helps beyond this.

Rule-of-Thumb Limitations

Rule-of-thumb methods assume Gaussian-like smoothness. They can substantially oversmooth for highly curved or wiggly functions, and undersmooth for very smooth (nearly linear) functions. Use them for initial exploration, then refine with CV or visual inspection.

Comparing Selection Methods

Head-to-Head Comparison:

Each bandwidth selection method has strengths and weaknesses. Understanding these helps you choose appropriately for your application.

Bandwidth Selection Methods Comparison
Method	Computation	When It Works Well	When It Struggles
LOOCV	O(n³) naïve, O(n²) with S	General purpose; automatic	Highly variable for small n; computationally expensive
GCV	O(n²) for S, O(n) otherwise	Large n; quick approximation to CV	Tends to undersmooth; needs smoother matrix
k-Fold CV	O(kn²/k²) ≈ O(n²)	Balance of speed and stability	Still variable; k is another hyperparameter
Plug-in	O(n) after pilot fit	Smooth, unimodal functions	Pilot model misspecification; bootstrap needed for SE
Rule-of-thumb	O(n) for scale estimate	Quick initial estimate; EDA	Ignores function-specific curvature; wrong for complex functions

bandwidth_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy import stats
 
def compare_bandwidth_methods(
    x: np.ndarray,
    y: np.ndarray,
    f_true: callable = None
) -> dict:
    """
    Compare different bandwidth selection methods on the same data.
    """
    n = len(x)
    
    results = {}
    
    # 1. Rule of thumb (normal reference)
    sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349)
    h_rot = 2.34 * sigma * n**(-0.2)  # Epanechnikov constant
    results['Rule-of-thumb'] = h_rot
    
    # 2. Simple plugin with polynomial pilot
    from numpy.polynomial import polynomial as P
    x_std = (x - x.min()) / (x.max() - x.min())
    coeffs = np.polyfit(x_std, y, 4)
    poly = np.poly1d(coeffs)
    poly_d2 = poly.deriv(2)
    
    from scipy.integrate import quad
    roughness, _ = quad(lambda t: poly_d2(t)**2, 0, 1)
    roughness /= (x.max() - x.min())**3
    
    residuals = y - poly(x_std)
    sigma2 = np.var(residuals, ddof=5)
    
    if roughness > 1e-10:
        h_plugin = (sigma2 * 0.6 / (0.04 * roughness))**0.2 * n**(-0.2)
        h_plugin *= (x.max() - x.min())
    else:
        h_plugin = h_rot
    results['Plug-in (poly pilot)'] = h_plugin
    
    # 3. Cross-validation (simplified search)
    h_range = np.linspace(0.1 * (x.max() - x.min()), 
                          0.5 * (x.max() - x.min()), 20)
    
    best_cv = np.inf
    h_cv = h_range[0]
    
    for h in h_range:
        # Simple NW for speed
        y_loo = np.zeros(n)
        for j in range(n):
            x0 = x[j]
            u = (x - x0) / h
            w = 0.75 * (1 - u**2) * (np.abs(u) < 1)
            w[j] = 0  # Leave out
            if np.sum(w) > 0:
                y_loo[j] = np.sum(w * y) / np.sum(w)
            else:
                y_loo[j] = np.mean(y)
        
        cv = np.mean((y - y_loo)**2)
        if cv < best_cv:
            best_cv = cv
            h_cv = h
    
    results['LOOCV'] = h_cv
    
    return results
 
# =============================================================================
# Run comparison on multiple scenarios
# =============================================================================
np.random.seed(42)
 
scenarios = [
    ("Sinusoidal", lambda x: np.sin(2*x), 0.3),
    ("Polynomial", lambda x: 0.5*x**2 - 2*x + 1, 0.3),
    ("Wiggly", lambda x: np.sin(x) + 0.5*np.sin(5*x), 0.3),
]
 
print("Bandwidth Selection Method Comparison")
print("=" * 70)
 
for name, f_true, noise in scenarios:
    n = 100
    x = np.sort(np.random.uniform(0, 2*np.pi, n))
    y = f_true(x) + np.random.normal(0, noise, n)
    
    results = compare_bandwidth_methods(x, y, f_true)
    
    print(f"
Scenario: {name}")
    print("-" * 40)
    for method, h in results.items():
        print(f"  {method:<25}: h = {h:.4f}")

Practical Recommendations:

For exploratory analysis:

Start with rule-of-thumb
Visually inspect the fit
Manually adjust if clearly over/undersmoothed

For production/publication:

Use LOOCV or k-fold CV (k=5 or 10)
Consider one-standard-error rule for robustness
Report the bandwidth used

For very large n (n > 10,000):

Use GCV (faster than full CV)
Or use plug-in with validated implementation
Consider subsampling for CV

When in doubt:

Smaller bandwidth errs on side of fitting noise → worse predictions but visible features
Larger bandwidth errs on side of missing signal → smoother but may miss patterns
For most applications, slight oversmoothing is less harmful than undersmoothing

Complete Practical Implementation

A Production-Ready Bandwidth Selector:

Combining the methods discussed, here's a robust implementation that handles edge cases and provides multiple estimates:

bandwidth_selector.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import Dict, Optional, Tuple
from dataclasses import dataclass
 
@dataclass
class BandwidthResult:
    """Container for bandwidth selection results."""
    h_optimal: float
    method: str
    cv_score: Optional[float] = None
    edf: Optional[float] = None
    h_alternatives: Optional[Dict[str, float]] = None
 
def select_bandwidth(
    x: np.ndarray,
    y: np.ndarray,
    method: str = 'auto',
    kernel: str = 'epanechnikov',
    degree: int = 1,
    cv_folds: int = 5,
    n_candidates: int = 25
) -> BandwidthResult:
    """
    Select bandwidth for local polynomial regression.
    
    Parameters:
        x: Input values (n,)
        y: Target values (n,)
        method: 'auto', 'cv', 'gcv', 'plugin', or 'rot'
        kernel: 'epanechnikov', 'gaussian', or 'quartic'
        degree: Local polynomial degree (1 recommended)
        cv_folds: Number of CV folds (if using k-fold)
        n_candidates: Number of bandwidth values to search
        
    Returns:
        BandwidthResult with optimal bandwidth and diagnostics
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    # Validate
    if n < 10:
        raise ValueError("Need at least 10 data points")
    
    x_range = x.max() - x.min()
    if x_range < 1e-10:
        raise ValueError("x values are constant or nearly constant")
    
    # Define kernel function
    kernels = {
        'epanechnikov': lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1),
        'gaussian': lambda u: np.exp(-u**2/2) / np.sqrt(2*np.pi),
        'quartic': lambda u: (15/16) * (1-u**2)**2 * (np.abs(u) < 1),
    }
    K = kernels.get(kernel, kernels['epanechnikov'])
    
    # Compute all alternatives
    alternatives = {}
    
    # Rule of thumb
    sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349)
    if sigma < 1e-10:
        sigma = np.std(x) + 1e-10
    
    rot_constants = {'epanechnikov': 2.34, 'gaussian': 1.06, 'quartic': 2.78}
    h_rot = rot_constants.get(kernel, 2.34) * sigma * n**(-0.2)
    alternatives['rot'] = h_rot
    
    # Define search range based on rule of thumb
    h_min = 0.1 * h_rot
    h_max = 3.0 * h_rot
    h_candidates = np.linspace(h_min, h_max, n_candidates)
    
    # Cross-validation
    def compute_cv_score(h):
        cv_error = 0
        for j in range(n):
            x0 = x[j]
            u = (x - x0) / h
            w = K(u)
            w[j] = 0  # Leave one out
            
            if np.sum(w) > 1e-10:
                if degree == 0:
                    y_pred = np.sum(w * y) / np.sum(w)
                else:
                    X_loc = np.column_stack([(x - x0)**p for p in range(degree+1)])
                    W = np.diag(w)
                    try:
                        XtWX = X_loc.T @ W @ X_loc
                        XtWy = X_loc.T @ W @ y
                        beta = np.linalg.solve(XtWX + 1e-8*np.eye(degree+1), XtWy)
                        y_pred = beta[0]
                    except:
                        y_pred = np.sum(w * y) / np.sum(w)
            else:
                y_pred = np.mean(y)
            
            cv_error += (y[j] - y_pred)**2
        
        return cv_error / n
    
    cv_scores = np.array([compute_cv_score(h) for h in h_candidates])
    h_cv = h_candidates[np.argmin(cv_scores)]
    alternatives['cv'] = h_cv
    
    # Plugin (simplified)
    x_std = (x - x.min()) / x_range
    coeffs = np.polyfit(x_std, y, min(4, n//10))
    poly = np.poly1d(coeffs)
    poly_d2 = poly.deriv(2)
    
    from scipy.integrate import quad
    try:
        roughness, _ = quad(lambda t: poly_d2(t)**2, 0, 1, limit=100)
        roughness = max(roughness, 1e-10) / x_range**3
        sigma2 = np.var(y - poly(x_std), ddof=len(coeffs))
        h_plugin = (sigma2 * 0.6 / (0.04 * roughness))**0.2 * n**(-0.2) * x_range
        alternatives['plugin'] = np.clip(h_plugin, h_min, h_max)
    except:
        alternatives['plugin'] = h_rot
    
    # Select based on method
    if method == 'auto':
        # Use CV for n < 1000, GCV or simpler for larger
        if n < 1000:
            method = 'cv'
        else:
            method = 'rot'  # Faster for large n
    
    h_optimal = alternatives.get(method, h_cv)
    
    return BandwidthResult(
        h_optimal=h_optimal,
        method=method,
        cv_score=cv_scores.min() if method == 'cv' else None,
        edf=None,  # Would need smoother matrix
        h_alternatives=alternatives
    )
 
# =============================================================================
# Example usage
# =============================================================================
np.random.seed(42)
 
n = 200
x = np.sort(np.random.uniform(0, 10, n))
y = np.sin(x) + 0.3 * np.cos(3*x) + np.random.normal(0, 0.3, n)
 
result = select_bandwidth(x, y, method='auto', degree=1)
 
print("Bandwidth Selection Results")
print("=" * 50)
print(f"Optimal bandwidth: h = {result.h_optimal:.4f}")
print(f"Selection method: {result.method}")
print(f"CV score: {result.cv_score:.6f}" if result.cv_score else "")
print(f"
Alternative estimates:")
for method, h in result.h_alternatives.items():
    print(f"  {method}: {h:.4f}")

Summary: Bandwidth Selection

We've developed a comprehensive toolkit for bandwidth selection. Let's consolidate the key insights:

Key Takeaways

•Bandwidth is THE hyperparameter: It controls bias-variance tradeoff completely. Get it wrong, get bad results.
•Optimal bandwidth $\sim n^{-1/5}$: Shrinks slowly with sample size; depends on function curvature and noise variance.
•Cross-validation is the workhorse: LOOCV directly estimates prediction error; computationally feasible via smoother matrix shortcut.
•GCV approximates CV: Replaces individual $S_{ii}$ with average tr(S)/n; slightly undersmooths but faster.
•Plug-in methods use formula directly: Estimate roughness $\int(f'')^2$ via pilot fit; can be sensitive to pilot choice.
•Rule-of-thumb provides quick start: Silverman's rule assumes Gaussian-like smoothness; adjust based on visual inspection.
•Multiple methods as sanity check: If methods agree, you're probably fine; if they diverge wildly, investigate further.

What's Next:

We've mastered one-dimensional nonparametric regression. But what happens in higher dimensions? The next page confronts the curse of dimensionality—the fundamental challenge that makes nonparametric methods increasingly difficult as the number of predictors grows.

Page Complete

You now have practical tools for bandwidth selection: cross-validation for accuracy, GCV for efficiency, plug-in for speed, and rule-of-thumb for quick exploration. Apply these judiciously based on your data size and computational constraints. Next, we confront the curse of dimensionality.

3 / 5

Loading learning content...

Machine LearningAdvanced Regression

Nonparametric Regression

LevelAdvanced

Duration120 mins

TopicAdvanced Regression

3 / 5

Bandwidth Selection

The Critical Hyperparameter

Too small: The estimator tracks noise, producing jagged, high-variance fits
Too large: The estimator oversmooths, missing genuine features of the underlying function
Just right: The estimator captures the true signal while filtering noise

Unlike parametric models where the model itself is the main choice, nonparametric regression essentially has one degree of freedom—the bandwidth. Getting it right is everything.

What You Will Learn

The Theoretical Optimal Bandwidth

Mean Integrated Squared Error:

The standard criterion for bandwidth selection is the Mean Integrated Squared Error (MISE):

$$\text{MISE}(h) = E \left[ \int \left( \hat{f}(x; h) - f(x) \right)^2 dx \right]$$

This measures the expected average squared deviation of the estimator from the true function across the entire domain.

Asymptotic MISE for Nadaraya-Watson:

For the NW estimator with kernel $K$ having second moment $\mu_2$ and roughness $R(K)$:

$$\text{AMISE}(h) = \frac{h^4 \mu_2^2}{4} \int [f''(x)]^2 dx + \frac{\sigma^2 R(K)}{nh}$$

(Simplified, ignoring design effects for cleaner exposition.)

Optimal Bandwidth:

Minimizing AMISE with respect to $h$ (taking derivative, setting to zero):

$$\frac{\partial \text{AMISE}}{\partial h} = h^3 \mu_2^2 \int [f''(x)]^2 dx - \frac{\sigma^2 R(K)}{nh^2} = 0$$

Solving for $h$:

$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int [f''(x)]^2 dx} \right)^{1/5} n^{-1/5}$$

Key Insights:

$n^{-1/5}$ rate: Optimal bandwidth shrinks slowly with sample size
Depends on $f''$: More curved functions need smaller bandwidth
Depends on $\sigma^2$: More noise allows larger bandwidth
Kernel dependence: Through $R(K)$ and $\mu_2$ (but these nearly cancel)

The catch: We don't know $f''(x)$ or $\sigma^2$! These must be estimated, leading to plug-in methods.

The Roughness Integral

Optimal MISE:

Substituting $h_{\text{opt}}$ back into AMISE:

$$\text{MISE}_{\text{opt}} = C(K) \cdot \left[ \sigma^2 \int [f''(x)]^2 dx \right]^{4/5} n^{-4/5}$$

where $C(K)$ is a constant depending only on the kernel.

This confirms the $n^{-4/5}$ convergence rate for nonparametric regression—slower than parametric rates but achievable with proper bandwidth selection.

Cross-Validation Methods

Leave-One-Out Cross-Validation (LOOCV):

The most straightforward data-driven approach. For each candidate bandwidth $h$:

For each $i = 1, \ldots, n$:
- Remove observation $(x_i, y_i)$
- Fit the smoother on remaining $n-1$ points
- Predict at $x_i$: $\hat{f}_{-i}(x_i; h)$
Compute the cross-validation score: $$\text{CV}(h) = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{f}_{-i}(x_i; h) \right)^2$$
Choose $h$ minimizing $\text{CV}(h)$

Why it works: CV(h) is an approximately unbiased estimate of the prediction error at new points. Minimizing it approximately minimizes MISE.

Efficient Computation via the Smoother Matrix:

Naïvely, LOOCV requires $n$ separate fits—expensive for large $n$. The key insight is that for linear smoothers (including kernel smoothers and local polynomial regression), we can express:

$$\hat{\mathbf{y}} = \mathbf{S} \mathbf{y}$$

where $\mathbf{S}$ is the smoother matrix (also called the hat matrix). Entry $S_{ij}$ gives the weight that observation $j$ contributes to the prediction at observation $i$.

The LOOCV shortcut:

$$\hat{f}{-i}(x_i) = \frac{\hat{f}(x_i) - S{ii} y_i}{1 - S_{ii}}$$

This allows computing all leave-one-out predictions from a single fit:

$$\text{CV}(h) = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{f}(x_i; h)}{1 - S_{ii}} \right)^2$$

Complexity: $O(n^2)$ to compute $\mathbf{S}$, then $O(n)$ for CV score—much better than $O(n^3)$ for naïve approach.

loocv_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Tuple, Callable, List
 
def compute_smoother_matrix(
    x: np.ndarray,
    h: float,
    kernel: Callable = None,
    degree: int = 1
) -> np.ndarray:
    """
    Compute the smoother matrix S such that y_hat = S @ y.
    
    For local polynomial regression of given degree.
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)  # Epanechnikov
    
    x = np.asarray(x).flatten()
    n = len(x)
    S = np.zeros((n, n))
    
    for j in range(n):
        x0 = x[j]
        
        # Kernel weights
        u = (x - x0) / h
        w = kernel(u)
        
        # Local design matrix
        X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
        
        # Weighted least squares
        W = np.diag(w)
        XtWX = X_local.T @ W @ X_local
        
        try:
            # S[j,:] = e_1^T (X'WX)^{-1} X'W
            XtWX_inv = np.linalg.inv(XtWX + 1e-10 * np.eye(degree + 1))
            S[j, :] = (X_local[0:1, :] @ XtWX_inv @ X_local.T @ W)[0]
            # Correction: row j gets the weights for predicting at x[j]
            e1 = np.zeros((degree + 1, 1))
            e1[0] = 1
            S[j, :] = (e1.T @ XtWX_inv @ X_local.T @ W).flatten()
        except np.linalg.LinAlgError:
            S[j, :] = w / (np.sum(w) + 1e-10)
    
    return S
 
def loocv_score(
    x: np.ndarray,
    y: np.ndarray,
    h: float,
    kernel: Callable = None,
    degree: int = 1
) -> float:
    """
    Compute leave-one-out cross-validation score for given bandwidth.
    Uses efficient computation via smoother matrix.
    """
    S = compute_smoother_matrix(x, h, kernel, degree)
    y_hat = S @ y
    
    # LOOCV formula: sum((y - y_hat)/(1 - S_ii))^2 / n
    diagonal = np.diag(S)
    
    # Prevent division by zero
    diagonal = np.clip(diagonal, -np.inf, 0.999)
    
    residuals = y - y_hat
    loocv_residuals = residuals / (1 - diagonal)
    
    return np.mean(loocv_residuals**2)
 
def select_bandwidth_cv(
    x: np.ndarray,
    y: np.ndarray,
    h_candidates: np.ndarray,
    kernel: Callable = None,
    degree: int = 1
) -> Tuple[float, np.ndarray]:
    """
    Select bandwidth by minimizing LOOCV score.
    
    Returns:
        h_opt: Optimal bandwidth
        cv_scores: CV scores for all candidates
    """
    cv_scores = np.array([
        loocv_score(x, y, h, kernel, degree) 
        for h in h_candidates
    ])
    
    h_opt = h_candidates[np.argmin(cv_scores)]
    return h_opt, cv_scores
 
# =============================================================================
# Demonstration
# =============================================================================
np.random.seed(42)
 
# Generate data
n = 100
x = np.sort(np.random.uniform(0, 2*np.pi, n))
f_true = lambda x: np.sin(x)
y = f_true(x) + np.random.normal(0, 0.3, n)
 
# Candidate bandwidths
h_range = np.linspace(0.1, 2.0, 30)
 
# Run CV
h_opt, cv_scores = select_bandwidth_cv(x, y, h_range, degree=1)
 
print("Bandwidth Selection via LOOCV")
print("=" * 50)
print(f"Optimal bandwidth: h = {h_opt:.3f}")
print(f"Minimum CV score: {cv_scores.min():.6f}")
print(f"
Top 5 bandwidths by CV score:")
sorted_idx = np.argsort(cv_scores)[:5]
for i, idx in enumerate(sorted_idx):
    print(f"  {i+1}. h = {h_range[idx]:.3f}, CV = {cv_scores[idx]:.6f}")

CV Variability

Generalized Cross-Validation (GCV)

From LOOCV to GCV:

LOOCV requires computing (or accessing) the diagonal elements $S_{ii}$ of the smoother matrix. For very large $n$, even storing $\mathbf{S}$ may be prohibitive.

Generalized Cross-Validation (Craven & Wahba, 1979) approximates LOOCV by replacing each $S_{ii}$ with their average:

$$\text{GCV}(h) = \frac{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2}{\left( 1 - \frac{\text{tr}(\mathbf{S})}{n} \right)^2}$$

where $\text{tr}(\mathbf{S}) = \sum_{i=1}^{n} S_{ii}$ is the effective degrees of freedom (EDF) of the smoother.

Interpretation:

Numerator: Mean squared residual (training error)
Denominator: Penalty for complexity (effective df / n)

GCV inflates training error based on model complexity, approximating true prediction error.

Effective Degrees of Freedom:

For linear smoothers, the effective degrees of freedom

$$\text{EDF}(h) = \text{tr}(\mathbf{S})$$

measures model complexity:

Small $h$ (local): Large EDF, fits data closely
Large $h$ (global): Small EDF, smooth fit

For OLS regression with $p$ predictors, $\text{tr}(\mathbf{H}) = p$ (exactly). For nonparametric smoothers, EDF is continuous in $h$, providing a smooth complexity measure.

Properties of GCV:

Asymptotically equivalent to LOOCV under mild conditions
Rotation invariant: Doesn't depend on coordinate system
Only needs tr(S): Often computable without full matrix
Tends to undersmooth slightly: May pick bandwidth slightly too small

gcv_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from typing import Tuple
 
def compute_edf(S: np.ndarray) -> float:
    """Effective degrees of freedom = trace of smoother matrix."""
    return np.trace(S)
 
def gcv_score(
    y: np.ndarray,
    y_hat: np.ndarray,
    edf: float
) -> float:
    """
    Generalized Cross-Validation score.
    
    GCV = MSE / (1 - edf/n)^2
    """
    n = len(y)
    mse = np.mean((y - y_hat)**2)
    
    # Prevent division by zero
    complexity_penalty = (1 - edf / n)**2
    if complexity_penalty < 1e-10:
        return np.inf
    
    return mse / complexity_penalty
 
def select_bandwidth_gcv(
    x: np.ndarray,
    y: np.ndarray,
    h_candidates: np.ndarray,
    kernel = None,
    degree: int = 1
) -> Tuple[float, np.ndarray, np.ndarray]:
    """
    Select bandwidth by minimizing GCV score.
    
    Returns:
        h_opt: Optimal bandwidth
        gcv_scores: GCV scores for all candidates
        edf_values: Effective degrees of freedom for all candidates
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    gcv_scores = []
    edf_values = []
    
    for h in h_candidates:
        # Compute smoother matrix and predictions
        S = np.zeros((n, n))
        
        for j in range(n):
            x0 = x[j]
            u = (x - x0) / h
            w = kernel(u)
            
            X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
            W = np.diag(w)
            XtWX = X_local.T @ W @ X_local
            
            try:
                XtWX_inv = np.linalg.inv(XtWX + 1e-10 * np.eye(degree + 1))
                e1 = np.zeros(degree + 1)
                e1[0] = 1
                S[j, :] = e1 @ XtWX_inv @ X_local.T @ W
            except:
                S[j, :] = w / (np.sum(w) + 1e-10)
        
        y_hat = S @ y
        edf = compute_edf(S)
        gcv = gcv_score(y, y_hat, edf)
        
        gcv_scores.append(gcv)
        edf_values.append(edf)
    
    gcv_scores = np.array(gcv_scores)
    edf_values = np.array(edf_values)
    
    h_opt = h_candidates[np.argmin(gcv_scores)]
    
    return h_opt, gcv_scores, edf_values
 
# =============================================================================
# Demonstration: Compare CV and GCV
# =============================================================================
np.random.seed(42)
 
n = 100
x = np.sort(np.random.uniform(0, 2*np.pi, n))
y = np.sin(x) + np.random.normal(0, 0.3, n)
 
h_range = np.linspace(0.15, 1.5, 25)
 
# GCV selection
h_gcv, gcv_scores, edf_values = select_bandwidth_gcv(x, y, h_range, degree=1)
 
print("GCV Bandwidth Selection")
print("=" * 60)
print(f"Optimal bandwidth (GCV): h = {h_gcv:.3f}")
print(f"")
print("Bandwidth vs. Effective Degrees of Freedom:")
print("-" * 60)
print(f"{'h':>8} | {'EDF':>8} | {'GCV':>12} | {'Complexity'}")
print("-" * 60)
for h, edf, gcv in list(zip(h_range, edf_values, gcv_scores))[::5]:
    complexity = "High" if edf > 30 else ("Medium" if edf > 15 else "Low")
    print(f"{h:>8.3f} | {edf:>8.2f} | {gcv:>12.6f} | {complexity}")

Plug-in Bandwidth Selectors

The Plug-in Idea:

Recall the optimal bandwidth formula:

$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int [f''(x)]^2 dx} \right)^{1/5} n^{-1/5}$$

The kernel constants $R(K)$ and $\mu_2$ are known. We need to estimate:

$\sigma^2$: The residual variance
$\int [f''(x)]^2 dx$: The integrated squared second derivative (roughness)

Plug-in approach: Estimate these quantities and substitute into the formula.

Step 1: Estimate $\sigma^2$

Use residuals from a pilot fit (e.g., with a larger bandwidth): $$\hat{\sigma}^2 = \frac{1}{n - \text{EDF}} \sum_{i=1}^{n} (y_i - \hat{f}_{\text{pilot}}(x_i))^2$$

Step 2: Estimate $\int [f''(x)]^2 dx$

This is the tricky part. We must estimate the second derivative of a function we don't know!

Approach 1: Parametric pilot

Fit a parametric model (e.g., polynomial of degree 4-6)
Compute $f''$ analytically
Integrate numerically

Approach 2: Nonparametric pilot

Use a kernel smoother with a preliminary bandwidth $g$
Estimate $f''$ from the smoother (requires special kernels or local quadratic fits)
But this requires choosing $g$... leading to a chicken-and-egg problem

Approach 3: Two-stage plug-in (Sheather-Jones style)

Use asymptotic formulas that relate optimal $g$ to the bandwidth $h$ we're trying to find
Iterate or solve coupled equations
This is the basis of modern plug-in methods like bw.nrd() in R

plugin_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy import stats, integrate
 
def plugin_bandwidth_normal(
    x: np.ndarray,
    kernel: str = 'epanechnikov'
) -> float:
    """
    Rule-of-thumb plug-in bandwidth assuming underlying Gaussian distribution.
    
    This is the 'normal reference rule' or 'Silverman's rule'.
    Optimal for Gaussian data + Gaussian kernel, but provides a 
    reasonable starting point for other smooth unimodal distributions.
    """
    x = np.asarray(x).flatten()
    n = len(x)
    
    # Estimate scale using robust IQR or std
    sigma = min(np.std(x), 
                (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349)
    
    # Kernel-specific constant
    # For Gaussian kernel targeting density estimation:
    # h_opt = 1.06 * sigma * n^{-1/5}
    # For regression, rates differ; we adjust
    
    kernel_constants = {
        'gaussian': 1.06,
        'epanechnikov': 2.34,  # Adjusted for bounded support
        'quartic': 2.62,
    }
    
    C = kernel_constants.get(kernel, 1.06)
    
    # Note: This is for density estimation; for regression, theory suggests
    # h ~ C * sigma * n^{-1/5} as well, but constants differ
    h = C * sigma * n**(-1/5)
    
    return h
 
def plugin_bandwidth_regression(
    x: np.ndarray,
    y: np.ndarray,
    pilot_degree: int = 4
) -> float:
    """
    Plug-in bandwidth selector for regression using polynomial pilot.
    
    1. Fit high-degree polynomial to estimate f''
    2. Compute roughness integral
    3. Estimate noise variance
    4. Plug into optimal bandwidth formula
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    # Standardize x to [0, 1] for numerical stability
    x_min, x_max = x.min(), x.max()
    x_range = x_max - x_min
    x_std = (x - x_min) / x_range
    
    # Step 1: Fit polynomial pilot
    # Use orthogonal polynomial basis for stability
    coeffs = np.polyfit(x_std, y, pilot_degree)
    poly = np.poly1d(coeffs)
    
    # Step 2: Compute second derivative
    poly_d2 = poly.deriv(2)
    
    # Step 3: Estimate roughness integral ∫[f''(x)]^2 dx
    # Integrate over [0, 1] (standardized domain)
    roughness_fn = lambda t: poly_d2(t)**2
    roughness, _ = integrate.quad(roughness_fn, 0, 1)
    
    # Scale back to original x domain: ∫[f''(x)]^2 dx = roughness / x_range^3
    # (Two derivatives each contribute factor 1/x_range)
    roughness_original = roughness / x_range**3
    
    # Step 4: Estimate sigma^2 from residuals
    y_pilot = poly(x_std)
    residuals = y - y_pilot
    sigma2 = np.var(residuals, ddof=pilot_degree + 1)
    
    # Step 5: Plug-in formula (Epanechnikov kernel)
    # R(K) = 3/5, mu_2 = 1/5 for Epanechnikov
    R_K = 0.6
    mu_2 = 0.2
    
    if roughness_original < 1e-10:
        # Nearly linear function - use rule of thumb
        h = plugin_bandwidth_normal(x * x_range + x_min, 'epanechnikov')
    else:
        # Optimal bandwidth formula
        h_std = (sigma2 * R_K / (mu_2**2 * roughness_original))**(1/5) * n**(-1/5)
        h = h_std * x_range  # Scale back to original domain
    
    return h
 
# =============================================================================
# Demonstration
# =============================================================================
np.random.seed(42)
 
# Generate data with known curvature
n = 150
x = np.sort(np.random.uniform(0, 2*np.pi, n))
f_true = lambda x: np.sin(x) + 0.3 * np.sin(3*x)
y = f_true(x) + np.random.normal(0, 0.25, n)
 
# Plugin bandwidth
h_plugin = plugin_bandwidth_regression(x, y, pilot_degree=6)
h_rot = plugin_bandwidth_normal(x, 'epanechnikov')
 
print("Plug-in Bandwidth Selection")
print("=" * 50)
print(f"Plugin (polynomial pilot, d=6): h = {h_plugin:.4f}")
print(f"Rule-of-thumb (normal reference): h = {h_rot:.4f}")
print(f"")
print("Note: Plugin bandwidth adapts to the function's curvature,")
print("      while rule-of-thumb assumes Gaussian-like smoothness.")

Practical Plug-in Methods

Rule-of-Thumb Estimators

When Quick Estimates Suffice:

For exploratory analysis or when computational resources are limited, rule-of-thumb bandwidth estimators provide reasonable starting points without iteration or cross-validation.

Silverman's Rule of Thumb (Normal Reference):

Assuming the true function has curvature similar to a Gaussian density:

$$h_{\text{rot}} = \left( \frac{4 \hat{\sigma}^5}{3n} \right)^{1/5} \approx 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$

where $\hat{\sigma}$ is a robust scale estimate: $$\hat{\sigma} = \min\left( s, \frac{\text{IQR}}{1.349} \right)$$

The IQR adjustment prevents inflation by outliers.

Rule-of-Thumb Bandwidths for Common Kernels
Kernel	h_rot formula	Constant C
Gaussian	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	1.06
Epanechnikov	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	2.34
Quartic (Biweight)	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	2.78
Triangular	$C \cdot \hat{\sigma} \cdot n^{-1/5}$	2.58

Span-Based Rules for LOESS:

For LOESS with nearest-neighbor bandwidth, the span $\alpha$ (fraction of data) is the hyperparameter. Rules of thumb:

$$\alpha_{\text{default}} = 0.75 \quad \text{(Cleveland's original default)}$$

$$\alpha_{\text{adaptive}} = \max\left( 0.1, \min\left( 0.9, \frac{c \cdot (p+1)}{n} \right) \right)$$

where $p$ is the polynomial degree and $c \approx 3\text{-}5$ ensures enough points for stable local fits.

Scott's Rule (for density estimation, adaptable to regression): $$h = 1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$$

Terrell's Maximal Smoothing: $$h = 3 \cdot \hat{\sigma} \cdot n^{-1/5}$$

Provides an upper bound for bandwidth—more smoothing rarely helps beyond this.

Rule-of-Thumb Limitations

Comparing Selection Methods

Head-to-Head Comparison:

Each bandwidth selection method has strengths and weaknesses. Understanding these helps you choose appropriately for your application.

Bandwidth Selection Methods Comparison
Method	Computation	When It Works Well	When It Struggles
LOOCV	O(n³) naïve, O(n²) with S	General purpose; automatic	Highly variable for small n; computationally expensive
GCV	O(n²) for S, O(n) otherwise	Large n; quick approximation to CV	Tends to undersmooth; needs smoother matrix
k-Fold CV	O(kn²/k²) ≈ O(n²)	Balance of speed and stability	Still variable; k is another hyperparameter
Plug-in	O(n) after pilot fit	Smooth, unimodal functions	Pilot model misspecification; bootstrap needed for SE
Rule-of-thumb	O(n) for scale estimate	Quick initial estimate; EDA	Ignores function-specific curvature; wrong for complex functions

bandwidth_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy import stats
 
def compare_bandwidth_methods(
    x: np.ndarray,
    y: np.ndarray,
    f_true: callable = None
) -> dict:
    """
    Compare different bandwidth selection methods on the same data.
    """
    n = len(x)
    
    results = {}
    
    # 1. Rule of thumb (normal reference)
    sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349)
    h_rot = 2.34 * sigma * n**(-0.2)  # Epanechnikov constant
    results['Rule-of-thumb'] = h_rot
    
    # 2. Simple plugin with polynomial pilot
    from numpy.polynomial import polynomial as P
    x_std = (x - x.min()) / (x.max() - x.min())
    coeffs = np.polyfit(x_std, y, 4)
    poly = np.poly1d(coeffs)
    poly_d2 = poly.deriv(2)
    
    from scipy.integrate import quad
    roughness, _ = quad(lambda t: poly_d2(t)**2, 0, 1)
    roughness /= (x.max() - x.min())**3
    
    residuals = y - poly(x_std)
    sigma2 = np.var(residuals, ddof=5)
    
    if roughness > 1e-10:
        h_plugin = (sigma2 * 0.6 / (0.04 * roughness))**0.2 * n**(-0.2)
        h_plugin *= (x.max() - x.min())
    else:
        h_plugin = h_rot
    results['Plug-in (poly pilot)'] = h_plugin
    
    # 3. Cross-validation (simplified search)
    h_range = np.linspace(0.1 * (x.max() - x.min()), 
                          0.5 * (x.max() - x.min()), 20)
    
    best_cv = np.inf
    h_cv = h_range[0]
    
    for h in h_range:
        # Simple NW for speed
        y_loo = np.zeros(n)
        for j in range(n):
            x0 = x[j]
            u = (x - x0) / h
            w = 0.75 * (1 - u**2) * (np.abs(u) < 1)
            w[j] = 0  # Leave out
            if np.sum(w) > 0:
                y_loo[j] = np.sum(w * y) / np.sum(w)
            else:
                y_loo[j] = np.mean(y)
        
        cv = np.mean((y - y_loo)**2)
        if cv < best_cv:
            best_cv = cv
            h_cv = h
    
    results['LOOCV'] = h_cv
    
    return results
 
# =============================================================================
# Run comparison on multiple scenarios
# =============================================================================
np.random.seed(42)
 
scenarios = [
    ("Sinusoidal", lambda x: np.sin(2*x), 0.3),
    ("Polynomial", lambda x: 0.5*x**2 - 2*x + 1, 0.3),
    ("Wiggly", lambda x: np.sin(x) + 0.5*np.sin(5*x), 0.3),
]
 
print("Bandwidth Selection Method Comparison")
print("=" * 70)
 
for name, f_true, noise in scenarios:
    n = 100
    x = np.sort(np.random.uniform(0, 2*np.pi, n))
    y = f_true(x) + np.random.normal(0, noise, n)
    
    results = compare_bandwidth_methods(x, y, f_true)
    
    print(f"
Scenario: {name}")
    print("-" * 40)
    for method, h in results.items():
        print(f"  {method:<25}: h = {h:.4f}")

Practical Recommendations:

For exploratory analysis:

Start with rule-of-thumb
Visually inspect the fit
Manually adjust if clearly over/undersmoothed

For production/publication:

Use LOOCV or k-fold CV (k=5 or 10)
Consider one-standard-error rule for robustness
Report the bandwidth used

For very large n (n > 10,000):

Use GCV (faster than full CV)
Or use plug-in with validated implementation
Consider subsampling for CV

When in doubt:

Smaller bandwidth errs on side of fitting noise → worse predictions but visible features
Larger bandwidth errs on side of missing signal → smoother but may miss patterns
For most applications, slight oversmoothing is less harmful than undersmoothing

Complete Practical Implementation

A Production-Ready Bandwidth Selector:

Combining the methods discussed, here's a robust implementation that handles edge cases and provides multiple estimates:

bandwidth_selector.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import Dict, Optional, Tuple
from dataclasses import dataclass
 
@dataclass
class BandwidthResult:
    """Container for bandwidth selection results."""
    h_optimal: float
    method: str
    cv_score: Optional[float] = None
    edf: Optional[float] = None
    h_alternatives: Optional[Dict[str, float]] = None
 
def select_bandwidth(
    x: np.ndarray,
    y: np.ndarray,
    method: str = 'auto',
    kernel: str = 'epanechnikov',
    degree: int = 1,
    cv_folds: int = 5,
    n_candidates: int = 25
) -> BandwidthResult:
    """
    Select bandwidth for local polynomial regression.
    
    Parameters:
        x: Input values (n,)
        y: Target values (n,)
        method: 'auto', 'cv', 'gcv', 'plugin', or 'rot'
        kernel: 'epanechnikov', 'gaussian', or 'quartic'
        degree: Local polynomial degree (1 recommended)
        cv_folds: Number of CV folds (if using k-fold)
        n_candidates: Number of bandwidth values to search
        
    Returns:
        BandwidthResult with optimal bandwidth and diagnostics
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    # Validate
    if n < 10:
        raise ValueError("Need at least 10 data points")
    
    x_range = x.max() - x.min()
    if x_range < 1e-10:
        raise ValueError("x values are constant or nearly constant")
    
    # Define kernel function
    kernels = {
        'epanechnikov': lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1),
        'gaussian': lambda u: np.exp(-u**2/2) / np.sqrt(2*np.pi),
        'quartic': lambda u: (15/16) * (1-u**2)**2 * (np.abs(u) < 1),
    }
    K = kernels.get(kernel, kernels['epanechnikov'])
    
    # Compute all alternatives
    alternatives = {}
    
    # Rule of thumb
    sigma = min(np.std(x), (np.percentile(x, 75) - np.percentile(x, 25)) / 1.349)
    if sigma < 1e-10:
        sigma = np.std(x) + 1e-10
    
    rot_constants = {'epanechnikov': 2.34, 'gaussian': 1.06, 'quartic': 2.78}
    h_rot = rot_constants.get(kernel, 2.34) * sigma * n**(-0.2)
    alternatives['rot'] = h_rot
    
    # Define search range based on rule of thumb
    h_min = 0.1 * h_rot
    h_max = 3.0 * h_rot
    h_candidates = np.linspace(h_min, h_max, n_candidates)
    
    # Cross-validation
    def compute_cv_score(h):
        cv_error = 0
        for j in range(n):
            x0 = x[j]
            u = (x - x0) / h
            w = K(u)
            w[j] = 0  # Leave one out
            
            if np.sum(w) > 1e-10:
                if degree == 0:
                    y_pred = np.sum(w * y) / np.sum(w)
                else:
                    X_loc = np.column_stack([(x - x0)**p for p in range(degree+1)])
                    W = np.diag(w)
                    try:
                        XtWX = X_loc.T @ W @ X_loc
                        XtWy = X_loc.T @ W @ y
                        beta = np.linalg.solve(XtWX + 1e-8*np.eye(degree+1), XtWy)
                        y_pred = beta[0]
                    except:
                        y_pred = np.sum(w * y) / np.sum(w)
            else:
                y_pred = np.mean(y)
            
            cv_error += (y[j] - y_pred)**2
        
        return cv_error / n
    
    cv_scores = np.array([compute_cv_score(h) for h in h_candidates])
    h_cv = h_candidates[np.argmin(cv_scores)]
    alternatives['cv'] = h_cv
    
    # Plugin (simplified)
    x_std = (x - x.min()) / x_range
    coeffs = np.polyfit(x_std, y, min(4, n//10))
    poly = np.poly1d(coeffs)
    poly_d2 = poly.deriv(2)
    
    from scipy.integrate import quad
    try:
        roughness, _ = quad(lambda t: poly_d2(t)**2, 0, 1, limit=100)
        roughness = max(roughness, 1e-10) / x_range**3
        sigma2 = np.var(y - poly(x_std), ddof=len(coeffs))
        h_plugin = (sigma2 * 0.6 / (0.04 * roughness))**0.2 * n**(-0.2) * x_range
        alternatives['plugin'] = np.clip(h_plugin, h_min, h_max)
    except:
        alternatives['plugin'] = h_rot
    
    # Select based on method
    if method == 'auto':
        # Use CV for n < 1000, GCV or simpler for larger
        if n < 1000:
            method = 'cv'
        else:
            method = 'rot'  # Faster for large n
    
    h_optimal = alternatives.get(method, h_cv)
    
    return BandwidthResult(
        h_optimal=h_optimal,
        method=method,
        cv_score=cv_scores.min() if method == 'cv' else None,
        edf=None,  # Would need smoother matrix
        h_alternatives=alternatives
    )
 
# =============================================================================
# Example usage
# =============================================================================
np.random.seed(42)
 
n = 200
x = np.sort(np.random.uniform(0, 10, n))
y = np.sin(x) + 0.3 * np.cos(3*x) + np.random.normal(0, 0.3, n)
 
result = select_bandwidth(x, y, method='auto', degree=1)
 
print("Bandwidth Selection Results")
print("=" * 50)
print(f"Optimal bandwidth: h = {result.h_optimal:.4f}")
print(f"Selection method: {result.method}")
print(f"CV score: {result.cv_score:.6f}" if result.cv_score else "")
print(f"
Alternative estimates:")
for method, h in result.h_alternatives.items():
    print(f"  {method}: {h:.4f}")

Summary: Bandwidth Selection

We've developed a comprehensive toolkit for bandwidth selection. Let's consolidate the key insights:

Key Takeaways

•Bandwidth is THE hyperparameter: It controls bias-variance tradeoff completely. Get it wrong, get bad results.
•Optimal bandwidth $\sim n^{-1/5}$: Shrinks slowly with sample size; depends on function curvature and noise variance.
•Cross-validation is the workhorse: LOOCV directly estimates prediction error; computationally feasible via smoother matrix shortcut.
•GCV approximates CV: Replaces individual $S_{ii}$ with average tr(S)/n; slightly undersmooths but faster.
•Plug-in methods use formula directly: Estimate roughness $\int(f'')^2$ via pilot fit; can be sensitive to pilot choice.
•Rule-of-thumb provides quick start: Silverman's rule assumes Gaussian-like smoothness; adjust based on visual inspection.
•Multiple methods as sanity check: If methods agree, you're probably fine; if they diverge wildly, investigate further.

What's Next:

Page Complete

3 / 5