Loading learning content...
In the previous page, we explored LOESS—a method that fits local polynomials using weighted least squares. Now we examine an even more fundamental approach: kernel smoothing, which estimates the regression function directly as a weighted average of observed responses.\n\nKernel smoothing represents the purest expression of the local averaging principle. Rather than fitting a local model and extracting a prediction, we simply ask: What is the weighted average of $y$ values near this point? The weights are determined by a kernel function that decays smoothly with distance.\n\nThis simplicity comes with deep theoretical underpinnings. Kernel smoothing connects to probability density estimation, spectral analysis, and the foundations of machine learning. Understanding kernel smoothers illuminates not just regression, but a broader framework for learning from data.
By the end of this page, you will understand: (1) The Nadaraya-Watson kernel estimator; (2) The mathematical theory of kernel functions; (3) Bias, variance, and optimal kernel choice; (4) Local polynomial extensions and their relationship to pure kernel smoothing; (5) Practical implementation considerations; (6) When to use kernel smoothing vs. LOESS.
The Fundamental Estimator:\n\nThe Nadaraya-Watson (NW) estimator, introduced independently by Nadaraya (1964) and Watson (1964), is defined as:\n\n$$\hat{f}(x_0) = \frac{\sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right) y_i}{\sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right)}$$\n\nwhere:\n- $K(\cdot)$ is a kernel function (non-negative, integrates to 1)\n- $h > 0$ is the bandwidth (controls the degree of smoothing)\n- The denominator normalizes the weights to sum to 1\n\nIntuition: We estimate $f(x_0)$ by averaging the $y_i$ values of nearby points, where 'nearby' is determined by the kernel $K$ and bandwidth $h$.
Derivation from Conditional Expectation:\n\nThe NW estimator emerges naturally from estimating the conditional expectation $E[Y | X = x_0]$.\n\nRecall that:\n$$E[Y | X = x] = \frac{\int y \cdot p(x, y) \, dy}{p(x)} = \frac{m(x)}{p(x)}$$\n\nwhere $m(x) = E[Y p(X | X = x)]$ and $p(x)$ is the marginal density of $X$.\n\nUsing kernel density estimation for both:\n$$\hat{p}(x_0) = \frac{1}{nh} \sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right)$$\n\n$$\hat{m}(x_0) = \frac{1}{nh} \sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right) y_i$$\n\nTaking the ratio gives the NW estimator:\n$$\hat{f}(x_0) = \frac{\hat{m}(x_0)}{\hat{p}(x_0)} = \frac{\sum_i K_h(x_i - x_0) y_i}{\sum_i K_h(x_i - x_0)}$$\n\nwhere $K_h(u) = K(u/h)/h$ is the scaled kernel.
The Nadaraya-Watson estimator is equivalent to local constant (degree 0) LOESS! Both compute a weighted average of nearby $y$ values. LOESS with degree 1 or higher goes beyond NW by fitting local polynomials rather than local constants.
Requirements for a Valid Kernel:\n\nA function $K: \mathbb{R} \to \mathbb{R}{\geq 0}$ is a valid kernel for smoothing if:\n\n1. Non-negativity: $K(u) \geq 0$ for all $u$\n2. Normalization: $\int{-\infty}^{\infty} K(u) \, du = 1$\n3. Symmetry: $K(u) = K(-u)$ (typically, but not required)\n4. Centered: $\int u \cdot K(u) \, du = 0$ (implied by symmetry)\n\nImportant kernel moments:\n$$\mu_j(K) = \int u^j K(u) \, du \quad \text{(j-th moment)}$$\n$$R(K) = \int K(u)^2 \, du \quad \text{(roughness)}$$\n\nFor symmetric kernels: $\mu_0 = 1$, $\mu_1 = 0$. The second moment $\mu_2$ and roughness $R(K)$ determine the bias and variance of the estimator.
| Kernel | Formula K(u) | Support | μ₂ | R(K) | Efficiency |
|---|---|---|---|---|---|
| Gaussian | $(2\pi)^{-1/2} e^{-u^2/2}$ | $(-\infty, \infty)$ | 1 | $(4\pi)^{-1/2}$ | 95.1% |
| Epanechnikov | $\frac{3}{4}(1-u^2)$ for $|u|<1$ | $[-1, 1]$ | 1/5 | 3/5 | 100% |
| Quartic | $\frac{15}{16}(1-u^2)^2$ for $|u|<1$ | $[-1, 1]$ | 1/7 | 5/7 | 99.4% |
| Tri-cube | $(1-|u|^3)^3$ for $|u|<1$ | $[-1, 1]$ | ~0.216 | ~0.815 | ~98% |
| Uniform | $\frac{1}{2}$ for $|u|<1$ | $[-1, 1]$ | 1/3 | 1/2 | 92.9% |
| Triangular | $(1-|u|)$ for $|u|<1$ | $[-1, 1]$ | 1/6 | 2/3 | 98.6% |
The Epanechnikov Kernel is Optimal:\n\nAmong all symmetric kernels with finite support, the Epanechnikov (parabolic) kernel minimizes the asymptotic mean integrated squared error (MISE). This is proven by calculus of variations:\n\n$$K^*(u) = \frac{3}{4}(1 - u^2) \cdot \mathbb{1}_{|u| \leq 1}$$\n\nHowever, the efficiency differences are small (5-8%), so kernel choice matters far less than bandwidth choice. In practice, choose based on convenience or desired smoothness, not optimality.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom typing import Callable def gaussian_kernel(u: np.ndarray) -> np.ndarray: """Gaussian kernel: (2π)^(-1/2) * exp(-u²/2)""" return np.exp(-u**2 / 2) / np.sqrt(2 * np.pi) def epanechnikov_kernel(u: np.ndarray) -> np.ndarray: """Epanechnikov kernel: 3/4 * (1-u²) for |u| < 1, else 0""" u = np.asarray(u) k = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 k[mask] = 0.75 * (1 - u[mask]**2) return k def quartic_kernel(u: np.ndarray) -> np.ndarray: """Quartic (biweight) kernel: 15/16 * (1-u²)² for |u| < 1""" u = np.asarray(u) k = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 k[mask] = (15/16) * (1 - u[mask]**2)**2 return k def triweight_kernel(u: np.ndarray) -> np.ndarray: """Triweight kernel: 35/32 * (1-u²)³ for |u| < 1""" u = np.asarray(u) k = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 k[mask] = (35/32) * (1 - u[mask]**2)**3 return k def uniform_kernel(u: np.ndarray) -> np.ndarray: """Uniform (box) kernel: 1/2 for |u| < 1""" return 0.5 * (np.abs(u) < 1).astype(float) def triangular_kernel(u: np.ndarray) -> np.ndarray: """Triangular kernel: (1-|u|) for |u| < 1""" u = np.asarray(u) k = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 k[mask] = 1 - np.abs(u[mask]) return k def compute_kernel_properties(K: Callable, name: str, support: tuple = (-5, 5)) -> dict: """ Compute key properties of a kernel function numerically. """ u = np.linspace(support[0], support[1], 10000) du = u[1] - u[0] k = K(u) # Integration (should be ~1) integral = np.sum(k) * du # Second moment μ₂ mu2 = np.sum(u**2 * k) * du # Roughness R(K) = ∫K(u)² du roughness = np.sum(k**2) * du return { 'name': name, 'integral': integral, 'mu2': mu2, 'roughness': roughness } # Compute and display propertieskernels = [ (gaussian_kernel, "Gaussian"), (epanechnikov_kernel, "Epanechnikov"), (quartic_kernel, "Quartic"), (uniform_kernel, "Uniform"), (triangular_kernel, "Triangular"),] print("Kernel Properties (Numerical)")print("=" * 60)print(f"{'Kernel':<15} | {'∫K(u)du':<12} | {'μ₂':<12} | {'R(K)':<12}")print("-" * 60) for K, name in kernels: props = compute_kernel_properties(K, name) print(f"{props['name']:<15} | {props['integral']:<12.4f} | " f"{props['mu2']:<12.4f} | {props['roughness']:<12.4f}")Compact support kernels (Epanechnikov, uniform, etc.) are exactly zero beyond a finite range, making computations local. Infinite support kernels (Gaussian) assign non-zero weight to all points but decay rapidly. For fixed bandwidth, compact kernels are computationally faster. For nearest-neighbor bandwidth (like LOESS), this distinction matters less.
Asymptotic Bias and Variance of NW Estimator:\n\nUnder standard regularity conditions ($f$ twice differentiable, $X$ has density $p(x)$ bounded away from zero), the Nadaraya-Watson estimator has:\n\nBias:\n$$\text{Bias}[\hat{f}{NW}(x_0)] \approx \frac{h^2 \mu_2(K)}{2} \left[ f''(x_0) + \frac{2 f'(x_0) p'(x_0)}{p(x_0)} \right]$$\n\nVariance:\n$$\text{Var}[\hat{f}{NW}(x_0)] \approx \frac{\sigma^2(x_0) R(K)}{n h p(x_0)}$$\n\nwhere $\sigma^2(x_0) = \text{Var}[Y | X = x_0]$ is the conditional variance.
Key Observations:\n\n1. Bias scales as $h^2$: The smoothing bias grows quadratically with bandwidth.\n\n2. Design bias term: The factor $\frac{2 f'(x_0) p'(x_0)}{p(x_0)}$ is called design bias. It captures bias arising from non-uniform distribution of $X$. If data is denser on one side, the weighted average is pulled toward that side.\n\n3. Variance scales as $1/(nh)$: Larger bandwidth → more data effectively used → lower variance.\n\n4. Bandwidth tradeoff: Small $h$ gives low bias but high variance; large $h$ gives low variance but high bias.\n\nMean Squared Error:\n$$\text{MSE}[\hat{f}(x_0)] \approx \frac{h^4 \mu_2^2 B^2}{4} + \frac{\sigma^2 R(K)}{nhp(x_0)}$$\n\nwhere $B = f''(x_0) + 2f'(x_0)p'(x_0)/p(x_0)$.
The design bias term $\frac{2f'(x_0)p'(x_0)}{p(x_0)}$ means NW performs poorly when the regression function has non-zero slope in regions where the data density is changing. This is a fundamental limitation that local linear regression eliminates. This is why degree-1 LOESS is preferred over kernel smoothing with degree 0.
Optimal Bandwidth for MISE:\n\nMinimizing the Mean Integrated Squared Error (MISE = $\int \text{MSE}[\hat{f}(x)] dx$) yields:\n\n$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int B(x)^2 p(x) \, dx} \right)^{1/5} n^{-1/5}$$\n\nThe famous $n^{-1/5}$ rate:\n- Optimal bandwidth shrinks as $n^{-1/5}$\n- MISE shrinks as $n^{-4/5}$\n- Slower than parametric $n^{-1}$ rate—the 'price' of nonparametric flexibility\n\nThis confirms that nonparametric regression requires substantially more data than correctly-specified parametric models.
Generalizing Beyond Local Constants:\n\nThe Nadaraya-Watson estimator fits a local constant. We can reduce bias by fitting higher-degree local polynomials:\n\n$$\min_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^{n} K_h(x_i - x_0) \left[ y_i - \sum_{j=0}^{p} \beta_j (x_i - x_0)^j \right]^2$$\n\nThe prediction at $x_0$ is $\hat{f}(x_0) = \hat{\beta}_0$.\n\nThis is exactly LOESS with fixed bandwidth $h$!\n\nThe choice of kernel matters little for local polynomial regression—the polynomial fitting dominates the behavior. What matters is the polynomial degree and bandwidth.
Bias of Local Polynomial Estimators:\n\nFor local polynomial of degree $p$, the leading bias term is:\n\n$$\text{Bias}[\hat{f}(x_0)] = O(h^{p+1}) \quad \text{if } p \text{ is odd}$$\n$$\text{Bias}[\hat{f}(x_0)] = O(h^{p+2}) \quad \text{if } p \text{ is even}$$\n\nWhy odd degrees are preferred:\n- Local constant ($p=0$): Bias $\sim h^2$ (includes design bias)\n- Local linear ($p=1$): Bias $\sim h^2$ but no design bias!\n- Local quadratic ($p=2$): Bias $\sim h^4$\n- Local cubic ($p=3$): Bias $\sim h^4$ with better constants\n\nOdd degrees have better bias constants and automatically correct for design effects. Local linear is the sweet spot for most applications.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npfrom typing import Tuple, Callable def local_polynomial_regression( x: np.ndarray, y: np.ndarray, x_eval: np.ndarray, h: float, kernel: Callable = None, degree: int = 1) -> np.ndarray: """ Local polynomial kernel regression. Parameters: x: Training inputs (n,) y: Training targets (n,) x_eval: Evaluation points (m,) h: Bandwidth kernel: Kernel function K(u). Defaults to Epanechnikov. degree: Local polynomial degree (0, 1, 2, ...) Returns: y_hat: Predictions at evaluation points (m,) """ if kernel is None: kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1) x = np.asarray(x).flatten() y = np.asarray(y).flatten() x_eval = np.asarray(x_eval).flatten() n = len(x) m = len(x_eval) y_hat = np.zeros(m) for j, x0 in enumerate(x_eval): # Compute kernel weights u = (x - x0) / h w = kernel(u) # Build local design matrix X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)]) # Weighted least squares W = np.diag(w) try: XtWX = X_local.T @ W @ X_local XtWy = X_local.T @ W @ y # Regularize if necessary if np.linalg.cond(XtWX) > 1e10: XtWX += 1e-8 * np.eye(degree + 1) beta = np.linalg.solve(XtWX, XtWy) y_hat[j] = beta[0] except np.linalg.LinAlgError: # Fallback to weighted average if np.sum(w) > 0: y_hat[j] = np.sum(w * y) / np.sum(w) else: y_hat[j] = np.mean(y) return y_hat def nadaraya_watson( x: np.ndarray, y: np.ndarray, x_eval: np.ndarray, h: float, kernel: Callable = None) -> np.ndarray: """ Nadaraya-Watson estimator (local constant kernel regression). This is equivalent to local_polynomial_regression with degree=0, but implemented more efficiently. """ if kernel is None: kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1) x = np.asarray(x).flatten() y = np.asarray(y).flatten() x_eval = np.asarray(x_eval).flatten() y_hat = np.zeros(len(x_eval)) for j, x0 in enumerate(x_eval): u = (x - x0) / h w = kernel(u) w_sum = np.sum(w) if w_sum > 0: y_hat[j] = np.sum(w * y) / w_sum else: y_hat[j] = np.mean(y) # Fallback return y_hat # =============================================================================# Compare degrees on a curved function with non-uniform design# =============================================================================np.random.seed(42) # Non-uniform design: more points on leftn = 200x = np.sort(np.random.beta(2, 5, n) * 10) # Left-skewedf_true = lambda x: np.sin(x) + 0.3 * xy = f_true(x) + np.random.normal(0, 0.5, n) x_eval = np.linspace(0, 10, 100)h = 0.8 # Fit with different degreesy_nw = nadaraya_watson(x, y, x_eval, h) # degree 0y_lin = local_polynomial_regression(x, y, x_eval, h, degree=1) # degree 1y_quad = local_polynomial_regression(x, y, x_eval, h, degree=2) # degree 2 # Compute MSE at evaluation pointsy_true_eval = f_true(x_eval)mse_nw = np.mean((y_nw - y_true_eval)**2)mse_lin = np.mean((y_lin - y_true_eval)**2)mse_quad = np.mean((y_quad - y_true_eval)**2) print("Local Polynomial Comparison (non-uniform design)")print("=" * 50)print(f"{'Degree':<12} | {'MSE':<12} | {'Notes'}")print("-" * 50)print(f"{'0 (NW)':<12} | {mse_nw:<12.6f} | Design bias visible")print(f"{'1 (Linear)':<12} | {mse_lin:<12.6f} | Best tradeoff")print(f"{'2 (Quadratic)':<12} | {mse_quad:<12.6f} | Higher variance")Efficient Implementation Strategies:\n\n1. Vectorized Computation:\n\nFor fixed bandwidth, we can compute all predictions efficiently using matrix operations:\n\n$$\hat{\mathbf{y}} = \mathbf{K} \mathbf{y} \oslash \mathbf{K} \mathbf{1}$$\n\nwhere $\mathbf{K}_{ij} = K\left(\frac{x^{\text{eval}}_i - x_j}{h}\right)$ is the kernel matrix, and $\oslash$ denotes element-wise division.\n\nComplexity:\n- Build kernel matrix: $O(nm)$ for $n$ training points and $m$ evaluation points\n- Matrix-vector products: $O(nm)$\n- Total: $O(nm)$ — quadratic in $n$ if $m = n$
2. Exploiting Compact Kernels:\n\nFor kernels with finite support (e.g., Epanechnikov), only points within distance $h$ of each evaluation point contribute. Use spatial data structures:\n\n- Sorted arrays + binary search: Find points in $[x_0 - h, x_0 + h]$ in $O(\log n)$\n- KD-trees: For multivariate $x$, efficiently find neighbors in $O(\log n)$ average\n- Ball trees: Better for high dimensions than KD-trees\n\n3. Binning and Interpolation:\n\nFor very large $n$:\n1. Bin data into a grid of $B$ bins\n2. Compute weighted bin averages\n3. Apply kernel smoothing to the $B$ bin values\n4. Interpolate to evaluation points\n\nComplexity: Reduces $O(nm)$ to $O(n + Bm)$. Works well when $B \ll n$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import numpy as npfrom typing import Optional def nadaraya_watson_vectorized( x: np.ndarray, y: np.ndarray, x_eval: np.ndarray, h: float, kernel_type: str = 'epanechnikov') -> np.ndarray: """ Vectorized Nadaraya-Watson for efficient computation. Uses numpy broadcasting to compute all weights simultaneously. Memory: O(nm) for n training, m evaluation points. """ x = np.asarray(x).flatten()[:, np.newaxis] # (n, 1) y = np.asarray(y).flatten() x_eval = np.asarray(x_eval).flatten()[np.newaxis, :] # (1, m) # Compute all pairwise scaled distances: (n, m) u = (x - x_eval) / h # Apply kernel if kernel_type == 'gaussian': K = np.exp(-u**2 / 2) / np.sqrt(2 * np.pi) elif kernel_type == 'epanechnikov': K = 0.75 * (1 - u**2) * (np.abs(u) < 1) elif kernel_type == 'uniform': K = 0.5 * (np.abs(u) < 1) else: raise ValueError(f"Unknown kernel: {kernel_type}") # Nadaraya-Watson: sum(K * y) / sum(K) numerator = K.T @ y # (m,) denominator = K.sum(axis=0) # (m,) # Handle division by zero denominator = np.maximum(denominator, 1e-10) return numerator / denominator def nadaraya_watson_sparse( x: np.ndarray, y: np.ndarray, x_eval: np.ndarray, h: float) -> np.ndarray: """ Nadaraya-Watson using sorted arrays for efficiency with compact kernels. Only examines points within ±h of each evaluation point. Much faster when h is small relative to the data range. """ x = np.asarray(x).flatten() y = np.asarray(y).flatten() x_eval = np.asarray(x_eval).flatten() # Sort training data for binary search order = np.argsort(x) x_sorted = x[order] y_sorted = y[order] n = len(x) def epanechnikov(u): return 0.75 * (1 - u**2) * (np.abs(u) < 1) y_hat = np.zeros(len(x_eval)) for j, x0 in enumerate(x_eval): # Binary search for [x0 - h, x0 + h] left = np.searchsorted(x_sorted, x0 - h) right = np.searchsorted(x_sorted, x0 + h) if left >= right: # No points in window y_hat[j] = np.mean(y) continue # Only examine points in window x_window = x_sorted[left:right] y_window = y_sorted[left:right] u = (x_window - x0) / h w = epanechnikov(u) w_sum = np.sum(w) if w_sum > 0: y_hat[j] = np.sum(w * y_window) / w_sum else: y_hat[j] = np.mean(y_window) return y_hat # =============================================================================# Benchmark# =============================================================================np.random.seed(42)n = 10000m = 1000x = np.sort(np.random.uniform(0, 100, n))y = np.sin(x / 10) + np.random.normal(0, 0.3, n)x_eval = np.linspace(0, 100, m)h = 2.0 import time # Vectorizedt0 = time.time()y_vec = nadaraya_watson_vectorized(x, y, x_eval, h)t_vec = time.time() - t0 # Sparse (exploits compact kernel)t0 = time.time()y_sparse = nadaraya_watson_sparse(x, y, x_eval, h)t_sparse = time.time() - t0 print("Performance Comparison")print("=" * 50)print(f"n = {n:,}, m = {m:,}, h = {h}")print(f"Vectorized: {t_vec:.4f}s")print(f"Sparse: {t_sparse:.4f}s")print(f"Speedup: {t_vec / t_sparse:.1f}x")print(f"Max diff: {np.max(np.abs(y_vec - y_sparse)):.6f}")In practice, use well-tested libraries: statsmodels.nonparametric.lowess (Python), scipy.ndimage.gaussian_filter1d (for Gaussian kernels), or sklearn.neighbors.KNeighborsRegressor (k-NN regression). These handle edge cases and optimize for performance.
Conceptual Relationship:\n\nKernel smoothing (Nadaraya-Watson) and LOESS are closely related:\n- NW estimator = LOESS with degree 0 and fixed bandwidth\n- Local linear = LOESS with degree 1\n- LOESS typically uses adaptive (nearest-neighbor) bandwidth\n- Kernel smoothing typically uses fixed bandwidth\n\nThe key practical differences are:\n\n1. Bandwidth adaptation: LOESS adapts to local data density; kernel smoothing uses fixed $h$\n\n2. Boundary behavior: Local linear (LOESS default) corrects boundary bias; NW does not\n\n3. Computation: NW is simpler and faster; local polynomial adds matrix inversions\n\n4. Robustness: LOESS has built-in robustifying iterations; standard kernel smoothing does not
| Aspect | Kernel Smoothing (NW) | LOESS (Local Linear) |
|---|---|---|
| Local model | Constant (weighted average) | Linear (or polynomial) |
| Bandwidth | Fixed $h$ | Adaptive (fraction $\alpha$ of data) |
| Boundary bias | Yes (systematic) | No (automatically corrected) |
| Design bias | Yes (problematic) | No (corrected) |
| Computation | $O(nm)$ simple | $O(nm p^3)$ matrix ops |
| Theory | Cleaner asymptotics | More complex |
| Derivatives | Not directly available | Local slope available |
| Outlier handling | None built-in | Robustifying iterations |
| Common use | Theory, quick exploration | Publication-quality smoothing |
For most practical applications, use LOESS with local linear fits (degree 1). The boundary and design bias corrections are worth the modest computational overhead. Reserve pure kernel smoothing (NW) for theoretical work, speed-critical applications with uniform designs, or as a building block for more complex methods.
Extension to Multiple Predictors:\n\nFor $\mathbf{x} = (x_1, \ldots, x_d) \in \mathbb{R}^d$, the multivariate NW estimator is:\n\n$$\hat{f}(\mathbf{x}0) = \frac{\sum{i=1}^{n} K_\mathbf{H}(\mathbf{x}i - \mathbf{x}0) y_i}{\sum{i=1}^{n} K\mathbf{H}(\mathbf{x}i - \mathbf{x}0)}$$\n\nwhere $\mathbf{H}$ is a $d \times d$ bandwidth matrix and:\n\n$$K\mathbf{H}(\mathbf{u}) = |\mathbf{H}|^{-1/2} K(\mathbf{H}^{-1/2} \mathbf{u})$$\n\nProduct kernel (spherical):\n\nThe simplest choice is a product of univariate kernels with equal bandwidth:\n$$K\mathbf{H}(\mathbf{u}) = \prod_{j=1}^{d} \frac{1}{h} K\left( \frac{u_j}{h} \right)$$\n\nBandwidth selection in high dimensions:\n\n- Optimal bandwidth: $h_{\text{opt}} = O(n^{-1/(4+d)})$\n- MISE: $O(n^{-4/(4+d)})$\n\nAs $d$ increases, both optimal bandwidth and convergence rate deteriorate dramatically—this is the curse of dimensionality, which we'll explore in detail later.
In $d = 10$ dimensions, the MISE rate is only $O(n^{-4/14}) \approx O(n^{-0.29})$. To halve the error, you need $2^{1/0.29} \approx 11$ times more data! This fundamental limitation makes nonparametric regression impractical for high-dimensional problems without additional structure.
We've developed a comprehensive understanding of kernel smoothing as the foundation of nonparametric regression. Let's consolidate the key concepts:
What's Next:\n\nWe've seen that bandwidth is the critical hyperparameter. But how do we choose it in practice? The next page develops principled approaches to bandwidth selection: cross-validation, plug-in methods, and rule-of-thumb estimators.
You now understand kernel smoothing from theory to implementation. The Nadaraya-Watson estimator and its local polynomial extensions form the foundation of nonparametric regression. Next, we tackle the practical challenge of selecting the bandwidth parameter.