Machine LearningAdvanced Regression

Nonparametric Regression

LevelAdvanced

Duration120 mins

TopicAdvanced Regression

2 / 5

Kernel Smoothing

Smoothing Through Weighted Averages

In the previous page, we explored LOESS—a method that fits local polynomials using weighted least squares. Now we examine an even more fundamental approach: kernel smoothing, which estimates the regression function directly as a weighted average of observed responses.\n\nKernel smoothing represents the purest expression of the local averaging principle. Rather than fitting a local model and extracting a prediction, we simply ask: What is the weighted average of $y$ values near this point? The weights are determined by a kernel function that decays smoothly with distance.\n\nThis simplicity comes with deep theoretical underpinnings. Kernel smoothing connects to probability density estimation, spectral analysis, and the foundations of machine learning. Understanding kernel smoothers illuminates not just regression, but a broader framework for learning from data.

What You Will Learn

By the end of this page, you will understand: (1) The Nadaraya-Watson kernel estimator; (2) The mathematical theory of kernel functions; (3) Bias, variance, and optimal kernel choice; (4) Local polynomial extensions and their relationship to pure kernel smoothing; (5) Practical implementation considerations; (6) When to use kernel smoothing vs. LOESS.

The Nadaraya-Watson Estimator

The Fundamental Estimator:\n\nThe Nadaraya-Watson (NW) estimator, introduced independently by Nadaraya (1964) and Watson (1964), is defined as:\n\n$$\hat{f}(x_0) = \frac{\sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right) y_i}{\sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right)}$$\n\nwhere:\n- $K(\cdot)$ is a kernel function (non-negative, integrates to 1)\n- $h > 0$ is the bandwidth (controls the degree of smoothing)\n- The denominator normalizes the weights to sum to 1\n\nIntuition: We estimate $f(x_0)$ by averaging the $y_i$ values of nearby points, where 'nearby' is determined by the kernel $K$ and bandwidth $h$.

Derivation from Conditional Expectation:\n\nThe NW estimator emerges naturally from estimating the conditional expectation $E[Y | X = x_0]$.\n\nRecall that:\n$$E[Y | X = x] = \frac{\int y \cdot p(x, y) \, dy}{p(x)} = \frac{m(x)}{p(x)}$$\n\nwhere $m(x) = E[Y p(X | X = x)]$ and $p(x)$ is the marginal density of $X$.\n\nUsing kernel density estimation for both:\n$$\hat{p}(x_0) = \frac{1}{nh} \sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right)$$\n\n$$\hat{m}(x_0) = \frac{1}{nh} \sum_{i=1}^{n} K\left( \frac{x_i - x_0}{h} \right) y_i$$\n\nTaking the ratio gives the NW estimator:\n$$\hat{f}(x_0) = \frac{\hat{m}(x_0)}{\hat{p}(x_0)} = \frac{\sum_i K_h(x_i - x_0) y_i}{\sum_i K_h(x_i - x_0)}$$\n\nwhere $K_h(u) = K(u/h)/h$ is the scaled kernel.

Connection to LOESS

The Nadaraya-Watson estimator is equivalent to local constant (degree 0) LOESS! Both compute a weighted average of nearby $y$ values. LOESS with degree 1 or higher goes beyond NW by fitting local polynomials rather than local constants.

Kernel Functions: Theory and Choices

Requirements for a Valid Kernel:\n\nA function $K: \mathbb{R} \to \mathbb{R}{\geq 0}$ is a valid kernel for smoothing if:\n\n1. Non-negativity: $K(u) \geq 0$ for all $u$\n2. Normalization: $\int{-\infty}^{\infty} K(u) \, du = 1$\n3. Symmetry: $K(u) = K(-u)$ (typically, but not required)\n4. Centered: $\int u \cdot K(u) \, du = 0$ (implied by symmetry)\n\nImportant kernel moments:\n$$\mu_j(K) = \int u^j K(u) \, du \quad \text{(j-th moment)}$$\n$$R(K) = \int K(u)^2 \, du \quad \text{(roughness)}$$\n\nFor symmetric kernels: $\mu_0 = 1$, $\mu_1 = 0$. The second moment $\mu_2$ and roughness $R(K)$ determine the bias and variance of the estimator.

Common Kernel Functions and Their Properties
Kernel	Formula K(u)	Support	μ₂	R(K)	Efficiency
Gaussian	$(2\pi)^{-1/2} e^{-u^2/2}$	$(-\infty, \infty)$	1	$(4\pi)^{-1/2}$	95.1%
Epanechnikov	$\frac{3}{4}(1-u^2)$ for $\|u\|<1$	$[-1, 1]$	1/5	3/5	100%
Quartic	$\frac{15}{16}(1-u^2)^2$ for $\|u\|<1$	$[-1, 1]$	1/7	5/7	99.4%
Tri-cube	$(1-\|u\|^3)^3$ for $\|u\|<1$	$[-1, 1]$	~0.216	~0.815	~98%
Uniform	$\frac{1}{2}$ for $\|u\|<1$	$[-1, 1]$	1/3	1/2	92.9%
Triangular	$(1-\|u\|)$ for $\|u\|<1$	$[-1, 1]$	1/6	2/3	98.6%

The Epanechnikov Kernel is Optimal:\n\nAmong all symmetric kernels with finite support, the Epanechnikov (parabolic) kernel minimizes the asymptotic mean integrated squared error (MISE). This is proven by calculus of variations:\n\n$$K^*(u) = \frac{3}{4}(1 - u^2) \cdot \mathbb{1}_{|u| \leq 1}$$\n\nHowever, the efficiency differences are small (5-8%), so kernel choice matters far less than bandwidth choice. In practice, choose based on convenience or desired smoothness, not optimality.

kernel_catalog.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import Callable
 
def gaussian_kernel(u: np.ndarray) -> np.ndarray:
    """Gaussian kernel: (2π)^(-1/2) * exp(-u²/2)"""
    return np.exp(-u**2 / 2) / np.sqrt(2 * np.pi)
 
def epanechnikov_kernel(u: np.ndarray) -> np.ndarray:
    """Epanechnikov kernel: 3/4 * (1-u²) for |u| < 1, else 0"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = 0.75 * (1 - u[mask]**2)
    return k
 
def quartic_kernel(u: np.ndarray) -> np.ndarray:
    """Quartic (biweight) kernel: 15/16 * (1-u²)² for |u| < 1"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = (15/16) * (1 - u[mask]**2)**2
    return k
 
def triweight_kernel(u: np.ndarray) -> np.ndarray:
    """Triweight kernel: 35/32 * (1-u²)³ for |u| < 1"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = (35/32) * (1 - u[mask]**2)**3
    return k
 
def uniform_kernel(u: np.ndarray) -> np.ndarray:
    """Uniform (box) kernel: 1/2 for |u| < 1"""
    return 0.5 * (np.abs(u) < 1).astype(float)
 
def triangular_kernel(u: np.ndarray) -> np.ndarray:
    """Triangular kernel: (1-|u|) for |u| < 1"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = 1 - np.abs(u[mask])
    return k
 
def compute_kernel_properties(K: Callable, name: str, 
                               support: tuple = (-5, 5)) -> dict:
    """
    Compute key properties of a kernel function numerically.
    """
    u = np.linspace(support[0], support[1], 10000)
    du = u[1] - u[0]
    k = K(u)
    
    # Integration (should be ~1)
    integral = np.sum(k) * du
    
    # Second moment μ₂
    mu2 = np.sum(u**2 * k) * du
    
    # Roughness R(K) = ∫K(u)² du
    roughness = np.sum(k**2) * du
    
    return {
        'name': name,
        'integral': integral,
        'mu2': mu2,
        'roughness': roughness
    }
 
# Compute and display properties
kernels = [
    (gaussian_kernel, "Gaussian"),
    (epanechnikov_kernel, "Epanechnikov"),
    (quartic_kernel, "Quartic"),
    (uniform_kernel, "Uniform"),
    (triangular_kernel, "Triangular"),
]
 
print("Kernel Properties (Numerical)")
print("=" * 60)
print(f"{'Kernel':<15} | {'∫K(u)du':<12} | {'μ₂':<12} | {'R(K)':<12}")
print("-" * 60)
 
for K, name in kernels:
    props = compute_kernel_properties(K, name)
    print(f"{props['name']:<15} | {props['integral']:<12.4f} | "
          f"{props['mu2']:<12.4f} | {props['roughness']:<12.4f}")

Compact vs. Infinite Support

Compact support kernels (Epanechnikov, uniform, etc.) are exactly zero beyond a finite range, making computations local. Infinite support kernels (Gaussian) assign non-zero weight to all points but decay rapidly. For fixed bandwidth, compact kernels are computationally faster. For nearest-neighbor bandwidth (like LOESS), this distinction matters less.

Bias-Variance Analysis

Asymptotic Bias and Variance of NW Estimator:\n\nUnder standard regularity conditions ($f$ twice differentiable, $X$ has density $p(x)$ bounded away from zero), the Nadaraya-Watson estimator has:\n\nBias:\n$$\text{Bias}[\hat{f}{NW}(x_0)] \approx \frac{h^2 \mu_2(K)}{2} \left[ f''(x_0) + \frac{2 f'(x_0) p'(x_0)}{p(x_0)} \right]$$\n\nVariance:\n$$\text{Var}[\hat{f}{NW}(x_0)] \approx \frac{\sigma^2(x_0) R(K)}{n h p(x_0)}$$\n\nwhere $\sigma^2(x_0) = \text{Var}[Y | X = x_0]$ is the conditional variance.

Key Observations:\n\n1. Bias scales as $h^2$: The smoothing bias grows quadratically with bandwidth.\n\n2. Design bias term: The factor $\frac{2 f'(x_0) p'(x_0)}{p(x_0)}$ is called design bias. It captures bias arising from non-uniform distribution of $X$. If data is denser on one side, the weighted average is pulled toward that side.\n\n3. Variance scales as $1/(nh)$: Larger bandwidth → more data effectively used → lower variance.\n\n4. Bandwidth tradeoff: Small $h$ gives low bias but high variance; large $h$ gives low variance but high bias.\n\nMean Squared Error:\n$$\text{MSE}[\hat{f}(x_0)] \approx \frac{h^4 \mu_2^2 B^2}{4} + \frac{\sigma^2 R(K)}{nhp(x_0)}$$\n\nwhere $B = f''(x_0) + 2f'(x_0)p'(x_0)/p(x_0)$.

Design Bias is the NW Estimator's Weakness

The design bias term $\frac{2f'(x_0)p'(x_0)}{p(x_0)}$ means NW performs poorly when the regression function has non-zero slope in regions where the data density is changing. This is a fundamental limitation that local linear regression eliminates. This is why degree-1 LOESS is preferred over kernel smoothing with degree 0.

Optimal Bandwidth for MISE:\n\nMinimizing the Mean Integrated Squared Error (MISE = $\int \text{MSE}[\hat{f}(x)] dx$) yields:\n\n$$h_{\text{opt}} = \left( \frac{\sigma^2 R(K)}{\mu_2^2 \int B(x)^2 p(x) \, dx} \right)^{1/5} n^{-1/5}$$\n\nThe famous $n^{-1/5}$ rate:\n- Optimal bandwidth shrinks as $n^{-1/5}$\n- MISE shrinks as $n^{-4/5}$\n- Slower than parametric $n^{-1}$ rate—the 'price' of nonparametric flexibility\n\nThis confirms that nonparametric regression requires substantially more data than correctly-specified parametric models.

Local Polynomial Kernel Regression

Generalizing Beyond Local Constants:\n\nThe Nadaraya-Watson estimator fits a local constant. We can reduce bias by fitting higher-degree local polynomials:\n\n$$\min_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^{n} K_h(x_i - x_0) \left[ y_i - \sum_{j=0}^{p} \beta_j (x_i - x_0)^j \right]^2$$\n\nThe prediction at $x_0$ is $\hat{f}(x_0) = \hat{\beta}_0$.\n\nThis is exactly LOESS with fixed bandwidth $h$!\n\nThe choice of kernel matters little for local polynomial regression—the polynomial fitting dominates the behavior. What matters is the polynomial degree and bandwidth.

Bias of Local Polynomial Estimators:\n\nFor local polynomial of degree $p$, the leading bias term is:\n\n$$\text{Bias}[\hat{f}(x_0)] = O(h^{p+1}) \quad \text{if } p \text{ is odd}$$\n$$\text{Bias}[\hat{f}(x_0)] = O(h^{p+2}) \quad \text{if } p \text{ is even}$$\n\nWhy odd degrees are preferred:\n- Local constant ($p=0$): Bias $\sim h^2$ (includes design bias)\n- Local linear ($p=1$): Bias $\sim h^2$ but no design bias!\n- Local quadratic ($p=2$): Bias $\sim h^4$\n- Local cubic ($p=3$): Bias $\sim h^4$ with better constants\n\nOdd degrees have better bias constants and automatically correct for design effects. Local linear is the sweet spot for most applications.

local_polynomial_kernel.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from typing import Tuple, Callable
 
def local_polynomial_regression(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float,
    kernel: Callable = None,
    degree: int = 1
) -> np.ndarray:
    """
    Local polynomial kernel regression.
    
    Parameters:
        x: Training inputs (n,)
        y: Training targets (n,)
        x_eval: Evaluation points (m,)
        h: Bandwidth
        kernel: Kernel function K(u). Defaults to Epanechnikov.
        degree: Local polynomial degree (0, 1, 2, ...)
        
    Returns:
        y_hat: Predictions at evaluation points (m,)
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()
    n = len(x)
    m = len(x_eval)
    
    y_hat = np.zeros(m)
    
    for j, x0 in enumerate(x_eval):
        # Compute kernel weights
        u = (x - x0) / h
        w = kernel(u)
        
        # Build local design matrix
        X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
        
        # Weighted least squares
        W = np.diag(w)
        try:
            XtWX = X_local.T @ W @ X_local
            XtWy = X_local.T @ W @ y
            
            # Regularize if necessary
            if np.linalg.cond(XtWX) > 1e10:
                XtWX += 1e-8 * np.eye(degree + 1)
            
            beta = np.linalg.solve(XtWX, XtWy)
            y_hat[j] = beta[0]
        except np.linalg.LinAlgError:
            # Fallback to weighted average
            if np.sum(w) > 0:
                y_hat[j] = np.sum(w * y) / np.sum(w)
            else:
                y_hat[j] = np.mean(y)
    
    return y_hat
 
def nadaraya_watson(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float,
    kernel: Callable = None
) -> np.ndarray:
    """
    Nadaraya-Watson estimator (local constant kernel regression).
    
    This is equivalent to local_polynomial_regression with degree=0,
    but implemented more efficiently.
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()
    
    y_hat = np.zeros(len(x_eval))
    
    for j, x0 in enumerate(x_eval):
        u = (x - x0) / h
        w = kernel(u)
        w_sum = np.sum(w)
        
        if w_sum > 0:
            y_hat[j] = np.sum(w * y) / w_sum
        else:
            y_hat[j] = np.mean(y)  # Fallback
    
    return y_hat
 
# =============================================================================
# Compare degrees on a curved function with non-uniform design
# =============================================================================
np.random.seed(42)
 
# Non-uniform design: more points on left
n = 200
x = np.sort(np.random.beta(2, 5, n) * 10)  # Left-skewed
f_true = lambda x: np.sin(x) + 0.3 * x
y = f_true(x) + np.random.normal(0, 0.5, n)
 
x_eval = np.linspace(0, 10, 100)
h = 0.8
 
# Fit with different degrees
y_nw = nadaraya_watson(x, y, x_eval, h)  # degree 0
y_lin = local_polynomial_regression(x, y, x_eval, h, degree=1)  # degree 1
y_quad = local_polynomial_regression(x, y, x_eval, h, degree=2)  # degree 2
 
# Compute MSE at evaluation points
y_true_eval = f_true(x_eval)
mse_nw = np.mean((y_nw - y_true_eval)**2)
mse_lin = np.mean((y_lin - y_true_eval)**2)
mse_quad = np.mean((y_quad - y_true_eval)**2)
 
print("Local Polynomial Comparison (non-uniform design)")
print("=" * 50)
print(f"{'Degree':<12} | {'MSE':<12} | {'Notes'}")
print("-" * 50)
print(f"{'0 (NW)':<12} | {mse_nw:<12.6f} | Design bias visible")
print(f"{'1 (Linear)':<12} | {mse_lin:<12.6f} | Best tradeoff")
print(f"{'2 (Quadratic)':<12} | {mse_quad:<12.6f} | Higher variance")

Implementation Considerations

Efficient Implementation Strategies:\n\n1. Vectorized Computation:\n\nFor fixed bandwidth, we can compute all predictions efficiently using matrix operations:\n\n$$\hat{\mathbf{y}} = \mathbf{K} \mathbf{y} \oslash \mathbf{K} \mathbf{1}$$\n\nwhere $\mathbf{K}_{ij} = K\left(\frac{x^{\text{eval}}_i - x_j}{h}\right)$ is the kernel matrix, and $\oslash$ denotes element-wise division.\n\nComplexity:\n- Build kernel matrix: $O(nm)$ for $n$ training points and $m$ evaluation points\n- Matrix-vector products: $O(nm)$\n- Total: $O(nm)$ — quadratic in $n$ if $m = n$

2. Exploiting Compact Kernels:\n\nFor kernels with finite support (e.g., Epanechnikov), only points within distance $h$ of each evaluation point contribute. Use spatial data structures:\n\n- Sorted arrays + binary search: Find points in $[x_0 - h, x_0 + h]$ in $O(\log n)$\n- KD-trees: For multivariate $x$, efficiently find neighbors in $O(\log n)$ average\n- Ball trees: Better for high dimensions than KD-trees\n\n3. Binning and Interpolation:\n\nFor very large $n$:\n1. Bin data into a grid of $B$ bins\n2. Compute weighted bin averages\n3. Apply kernel smoothing to the $B$ bin values\n4. Interpolate to evaluation points\n\nComplexity: Reduces $O(nm)$ to $O(n + Bm)$. Works well when $B \ll n$.

efficient_kernel_smoothing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from typing import Optional
 
def nadaraya_watson_vectorized(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float,
    kernel_type: str = 'epanechnikov'
) -> np.ndarray:
    """
    Vectorized Nadaraya-Watson for efficient computation.
    
    Uses numpy broadcasting to compute all weights simultaneously.
    Memory: O(nm) for n training, m evaluation points.
    """
    x = np.asarray(x).flatten()[:, np.newaxis]  # (n, 1)
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()[np.newaxis, :]  # (1, m)
    
    # Compute all pairwise scaled distances: (n, m)
    u = (x - x_eval) / h
    
    # Apply kernel
    if kernel_type == 'gaussian':
        K = np.exp(-u**2 / 2) / np.sqrt(2 * np.pi)
    elif kernel_type == 'epanechnikov':
        K = 0.75 * (1 - u**2) * (np.abs(u) < 1)
    elif kernel_type == 'uniform':
        K = 0.5 * (np.abs(u) < 1)
    else:
        raise ValueError(f"Unknown kernel: {kernel_type}")
    
    # Nadaraya-Watson: sum(K * y) / sum(K)
    numerator = K.T @ y  # (m,)
    denominator = K.sum(axis=0)  # (m,)
    
    # Handle division by zero
    denominator = np.maximum(denominator, 1e-10)
    
    return numerator / denominator
 
def nadaraya_watson_sparse(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float
) -> np.ndarray:
    """
    Nadaraya-Watson using sorted arrays for efficiency with compact kernels.
    
    Only examines points within ±h of each evaluation point.
    Much faster when h is small relative to the data range.
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()
    
    # Sort training data for binary search
    order = np.argsort(x)
    x_sorted = x[order]
    y_sorted = y[order]
    n = len(x)
    
    def epanechnikov(u):
        return 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    y_hat = np.zeros(len(x_eval))
    
    for j, x0 in enumerate(x_eval):
        # Binary search for [x0 - h, x0 + h]
        left = np.searchsorted(x_sorted, x0 - h)
        right = np.searchsorted(x_sorted, x0 + h)
        
        if left >= right:
            # No points in window
            y_hat[j] = np.mean(y)
            continue
        
        # Only examine points in window
        x_window = x_sorted[left:right]
        y_window = y_sorted[left:right]
        
        u = (x_window - x0) / h
        w = epanechnikov(u)
        
        w_sum = np.sum(w)
        if w_sum > 0:
            y_hat[j] = np.sum(w * y_window) / w_sum
        else:
            y_hat[j] = np.mean(y_window)
    
    return y_hat
 
# =============================================================================
# Benchmark
# =============================================================================
np.random.seed(42)
n = 10000
m = 1000
x = np.sort(np.random.uniform(0, 100, n))
y = np.sin(x / 10) + np.random.normal(0, 0.3, n)
x_eval = np.linspace(0, 100, m)
h = 2.0
 
import time
 
# Vectorized
t0 = time.time()
y_vec = nadaraya_watson_vectorized(x, y, x_eval, h)
t_vec = time.time() - t0
 
# Sparse (exploits compact kernel)
t0 = time.time()
y_sparse = nadaraya_watson_sparse(x, y, x_eval, h)
t_sparse = time.time() - t0
 
print("Performance Comparison")
print("=" * 50)
print(f"n = {n:,}, m = {m:,}, h = {h}")
print(f"Vectorized:  {t_vec:.4f}s")
print(f"Sparse:      {t_sparse:.4f}s")
print(f"Speedup:     {t_vec / t_sparse:.1f}x")
print(f"Max diff:    {np.max(np.abs(y_vec - y_sparse)):.6f}")

Library Implementations

In practice, use well-tested libraries: statsmodels.nonparametric.lowess (Python), scipy.ndimage.gaussian_filter1d (for Gaussian kernels), or sklearn.neighbors.KNeighborsRegressor (k-NN regression). These handle edge cases and optimize for performance.

Kernel Smoothing vs. LOESS: When to Use Which

Conceptual Relationship:\n\nKernel smoothing (Nadaraya-Watson) and LOESS are closely related:\n- NW estimator = LOESS with degree 0 and fixed bandwidth\n- Local linear = LOESS with degree 1\n- LOESS typically uses adaptive (nearest-neighbor) bandwidth\n- Kernel smoothing typically uses fixed bandwidth\n\nThe key practical differences are:\n\n1. Bandwidth adaptation: LOESS adapts to local data density; kernel smoothing uses fixed $h$\n\n2. Boundary behavior: Local linear (LOESS default) corrects boundary bias; NW does not\n\n3. Computation: NW is simpler and faster; local polynomial adds matrix inversions\n\n4. Robustness: LOESS has built-in robustifying iterations; standard kernel smoothing does not

Kernel Smoothing vs. LOESS Comparison
Aspect	Kernel Smoothing (NW)	LOESS (Local Linear)
Local model	Constant (weighted average)	Linear (or polynomial)
Bandwidth	Fixed $h$	Adaptive (fraction $\alpha$ of data)
Boundary bias	Yes (systematic)	No (automatically corrected)
Design bias	Yes (problematic)	No (corrected)
Computation	$O(nm)$ simple	$O(nm p^3)$ matrix ops
Theory	Cleaner asymptotics	More complex
Derivatives	Not directly available	Local slope available
Outlier handling	None built-in	Robustifying iterations
Common use	Theory, quick exploration	Publication-quality smoothing

Choose Kernel Smoothing When

•Data density is roughly uniform
•Speed is critical
•Theoretical analysis is needed
•Connecting to density estimation
•Interior estimation (not at boundaries)
•Initial exploratory analysis

Choose LOESS When

•Data density varies significantly
•Boundaries matter (time series, spatial edges)
•Outliers may be present
•Local slopes are of interest
•Publication-quality visualizations
•General-purpose robust smoothing

The Practical Bottom Line

For most practical applications, use LOESS with local linear fits (degree 1). The boundary and design bias corrections are worth the modest computational overhead. Reserve pure kernel smoothing (NW) for theoretical work, speed-critical applications with uniform designs, or as a building block for more complex methods.

Multivariate Kernel Smoothing

Extension to Multiple Predictors:\n\nFor $\mathbf{x} = (x_1, \ldots, x_d) \in \mathbb{R}^d$, the multivariate NW estimator is:\n\n$$\hat{f}(\mathbf{x}0) = \frac{\sum{i=1}^{n} K_\mathbf{H}(\mathbf{x}i - \mathbf{x}0) y_i}{\sum{i=1}^{n} K\mathbf{H}(\mathbf{x}i - \mathbf{x}0)}$$\n\nwhere $\mathbf{H}$ is a $d \times d$ bandwidth matrix and:\n\n$$K\mathbf{H}(\mathbf{u}) = |\mathbf{H}|^{-1/2} K(\mathbf{H}^{-1/2} \mathbf{u})$$\n\nProduct kernel (spherical):\n\nThe simplest choice is a product of univariate kernels with equal bandwidth:\n$$K\mathbf{H}(\mathbf{u}) = \prod_{j=1}^{d} \frac{1}{h} K\left( \frac{u_j}{h} \right)$$\n\nBandwidth selection in high dimensions:\n\n- Optimal bandwidth: $h_{\text{opt}} = O(n^{-1/(4+d)})$\n- MISE: $O(n^{-4/(4+d)})$\n\nAs $d$ increases, both optimal bandwidth and convergence rate deteriorate dramatically—this is the curse of dimensionality, which we'll explore in detail later.

Curse of Dimensionality Preview

In $d = 10$ dimensions, the MISE rate is only $O(n^{-4/14}) \approx O(n^{-0.29})$. To halve the error, you need $2^{1/0.29} \approx 11$ times more data! This fundamental limitation makes nonparametric regression impractical for high-dimensional problems without additional structure.

Summary: Kernel Smoothing

We've developed a comprehensive understanding of kernel smoothing as the foundation of nonparametric regression. Let's consolidate the key concepts:

Key Takeaways

•Nadaraya-Watson estimator: A weighted average $\hat{f}(x_0) = \sum_i w_i y_i / \sum_i w_i$ where weights decay with distance via a kernel function.
•Kernel functions: Symmetric, non-negative, integrate to 1. Epanechnikov is theoretically optimal but kernel choice matters little in practice.
•Bias-variance tradeoff: Bias $\sim h^2$, variance $\sim 1/(nh)$. Optimal bandwidth balances these, giving MISE $\sim n^{-4/5}$.
•Design bias: NW suffers from bias in regions with non-uniform data density. Local linear regression eliminates this.
•Local polynomial extends kernel smoothing: Fitting local lines/polynomials reduces bias, especially at boundaries.
•LOESS vs. NW: LOESS (local linear, adaptive bandwidth) is generally preferred for its boundary and design bias corrections.
•Multivariate extension: Product kernels work but suffer from curse of dimensionality as $d$ increases.

What's Next:\n\nWe've seen that bandwidth is the critical hyperparameter. But how do we choose it in practice? The next page develops principled approaches to bandwidth selection: cross-validation, plug-in methods, and rule-of-thumb estimators.

Page Complete

You now understand kernel smoothing from theory to implementation. The Nadaraya-Watson estimator and its local polynomial extensions form the foundation of nonparametric regression. Next, we tackle the practical challenge of selecting the bandwidth parameter.

2 / 5

Loading learning content...

Machine LearningAdvanced Regression

Nonparametric Regression

LevelAdvanced

Duration120 mins

TopicAdvanced Regression

2 / 5

Kernel Smoothing

Smoothing Through Weighted Averages

What You Will Learn

The Nadaraya-Watson Estimator

Connection to LOESS

Kernel Functions: Theory and Choices

Common Kernel Functions and Their Properties
Kernel	Formula K(u)	Support	μ₂	R(K)	Efficiency
Gaussian	$(2\pi)^{-1/2} e^{-u^2/2}$	$(-\infty, \infty)$	1	$(4\pi)^{-1/2}$	95.1%
Epanechnikov	$\frac{3}{4}(1-u^2)$ for $\|u\|<1$	$[-1, 1]$	1/5	3/5	100%
Quartic	$\frac{15}{16}(1-u^2)^2$ for $\|u\|<1$	$[-1, 1]$	1/7	5/7	99.4%
Tri-cube	$(1-\|u\|^3)^3$ for $\|u\|<1$	$[-1, 1]$	~0.216	~0.815	~98%
Uniform	$\frac{1}{2}$ for $\|u\|<1$	$[-1, 1]$	1/3	1/2	92.9%
Triangular	$(1-\|u\|)$ for $\|u\|<1$	$[-1, 1]$	1/6	2/3	98.6%

kernel_catalog.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import Callable
 
def gaussian_kernel(u: np.ndarray) -> np.ndarray:
    """Gaussian kernel: (2π)^(-1/2) * exp(-u²/2)"""
    return np.exp(-u**2 / 2) / np.sqrt(2 * np.pi)
 
def epanechnikov_kernel(u: np.ndarray) -> np.ndarray:
    """Epanechnikov kernel: 3/4 * (1-u²) for |u| < 1, else 0"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = 0.75 * (1 - u[mask]**2)
    return k
 
def quartic_kernel(u: np.ndarray) -> np.ndarray:
    """Quartic (biweight) kernel: 15/16 * (1-u²)² for |u| < 1"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = (15/16) * (1 - u[mask]**2)**2
    return k
 
def triweight_kernel(u: np.ndarray) -> np.ndarray:
    """Triweight kernel: 35/32 * (1-u²)³ for |u| < 1"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = (35/32) * (1 - u[mask]**2)**3
    return k
 
def uniform_kernel(u: np.ndarray) -> np.ndarray:
    """Uniform (box) kernel: 1/2 for |u| < 1"""
    return 0.5 * (np.abs(u) < 1).astype(float)
 
def triangular_kernel(u: np.ndarray) -> np.ndarray:
    """Triangular kernel: (1-|u|) for |u| < 1"""
    u = np.asarray(u)
    k = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    k[mask] = 1 - np.abs(u[mask])
    return k
 
def compute_kernel_properties(K: Callable, name: str, 
                               support: tuple = (-5, 5)) -> dict:
    """
    Compute key properties of a kernel function numerically.
    """
    u = np.linspace(support[0], support[1], 10000)
    du = u[1] - u[0]
    k = K(u)
    
    # Integration (should be ~1)
    integral = np.sum(k) * du
    
    # Second moment μ₂
    mu2 = np.sum(u**2 * k) * du
    
    # Roughness R(K) = ∫K(u)² du
    roughness = np.sum(k**2) * du
    
    return {
        'name': name,
        'integral': integral,
        'mu2': mu2,
        'roughness': roughness
    }
 
# Compute and display properties
kernels = [
    (gaussian_kernel, "Gaussian"),
    (epanechnikov_kernel, "Epanechnikov"),
    (quartic_kernel, "Quartic"),
    (uniform_kernel, "Uniform"),
    (triangular_kernel, "Triangular"),
]
 
print("Kernel Properties (Numerical)")
print("=" * 60)
print(f"{'Kernel':<15} | {'∫K(u)du':<12} | {'μ₂':<12} | {'R(K)':<12}")
print("-" * 60)
 
for K, name in kernels:
    props = compute_kernel_properties(K, name)
    print(f"{props['name']:<15} | {props['integral']:<12.4f} | "
          f"{props['mu2']:<12.4f} | {props['roughness']:<12.4f}")

Compact vs. Infinite Support

Bias-Variance Analysis

Design Bias is the NW Estimator's Weakness

Local Polynomial Kernel Regression

local_polynomial_kernel.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from typing import Tuple, Callable
 
def local_polynomial_regression(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float,
    kernel: Callable = None,
    degree: int = 1
) -> np.ndarray:
    """
    Local polynomial kernel regression.
    
    Parameters:
        x: Training inputs (n,)
        y: Training targets (n,)
        x_eval: Evaluation points (m,)
        h: Bandwidth
        kernel: Kernel function K(u). Defaults to Epanechnikov.
        degree: Local polynomial degree (0, 1, 2, ...)
        
    Returns:
        y_hat: Predictions at evaluation points (m,)
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()
    n = len(x)
    m = len(x_eval)
    
    y_hat = np.zeros(m)
    
    for j, x0 in enumerate(x_eval):
        # Compute kernel weights
        u = (x - x0) / h
        w = kernel(u)
        
        # Build local design matrix
        X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
        
        # Weighted least squares
        W = np.diag(w)
        try:
            XtWX = X_local.T @ W @ X_local
            XtWy = X_local.T @ W @ y
            
            # Regularize if necessary
            if np.linalg.cond(XtWX) > 1e10:
                XtWX += 1e-8 * np.eye(degree + 1)
            
            beta = np.linalg.solve(XtWX, XtWy)
            y_hat[j] = beta[0]
        except np.linalg.LinAlgError:
            # Fallback to weighted average
            if np.sum(w) > 0:
                y_hat[j] = np.sum(w * y) / np.sum(w)
            else:
                y_hat[j] = np.mean(y)
    
    return y_hat
 
def nadaraya_watson(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float,
    kernel: Callable = None
) -> np.ndarray:
    """
    Nadaraya-Watson estimator (local constant kernel regression).
    
    This is equivalent to local_polynomial_regression with degree=0,
    but implemented more efficiently.
    """
    if kernel is None:
        kernel = lambda u: 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()
    
    y_hat = np.zeros(len(x_eval))
    
    for j, x0 in enumerate(x_eval):
        u = (x - x0) / h
        w = kernel(u)
        w_sum = np.sum(w)
        
        if w_sum > 0:
            y_hat[j] = np.sum(w * y) / w_sum
        else:
            y_hat[j] = np.mean(y)  # Fallback
    
    return y_hat
 
# =============================================================================
# Compare degrees on a curved function with non-uniform design
# =============================================================================
np.random.seed(42)
 
# Non-uniform design: more points on left
n = 200
x = np.sort(np.random.beta(2, 5, n) * 10)  # Left-skewed
f_true = lambda x: np.sin(x) + 0.3 * x
y = f_true(x) + np.random.normal(0, 0.5, n)
 
x_eval = np.linspace(0, 10, 100)
h = 0.8
 
# Fit with different degrees
y_nw = nadaraya_watson(x, y, x_eval, h)  # degree 0
y_lin = local_polynomial_regression(x, y, x_eval, h, degree=1)  # degree 1
y_quad = local_polynomial_regression(x, y, x_eval, h, degree=2)  # degree 2
 
# Compute MSE at evaluation points
y_true_eval = f_true(x_eval)
mse_nw = np.mean((y_nw - y_true_eval)**2)
mse_lin = np.mean((y_lin - y_true_eval)**2)
mse_quad = np.mean((y_quad - y_true_eval)**2)
 
print("Local Polynomial Comparison (non-uniform design)")
print("=" * 50)
print(f"{'Degree':<12} | {'MSE':<12} | {'Notes'}")
print("-" * 50)
print(f"{'0 (NW)':<12} | {mse_nw:<12.6f} | Design bias visible")
print(f"{'1 (Linear)':<12} | {mse_lin:<12.6f} | Best tradeoff")
print(f"{'2 (Quadratic)':<12} | {mse_quad:<12.6f} | Higher variance")

Implementation Considerations

efficient_kernel_smoothing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from typing import Optional
 
def nadaraya_watson_vectorized(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float,
    kernel_type: str = 'epanechnikov'
) -> np.ndarray:
    """
    Vectorized Nadaraya-Watson for efficient computation.
    
    Uses numpy broadcasting to compute all weights simultaneously.
    Memory: O(nm) for n training, m evaluation points.
    """
    x = np.asarray(x).flatten()[:, np.newaxis]  # (n, 1)
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()[np.newaxis, :]  # (1, m)
    
    # Compute all pairwise scaled distances: (n, m)
    u = (x - x_eval) / h
    
    # Apply kernel
    if kernel_type == 'gaussian':
        K = np.exp(-u**2 / 2) / np.sqrt(2 * np.pi)
    elif kernel_type == 'epanechnikov':
        K = 0.75 * (1 - u**2) * (np.abs(u) < 1)
    elif kernel_type == 'uniform':
        K = 0.5 * (np.abs(u) < 1)
    else:
        raise ValueError(f"Unknown kernel: {kernel_type}")
    
    # Nadaraya-Watson: sum(K * y) / sum(K)
    numerator = K.T @ y  # (m,)
    denominator = K.sum(axis=0)  # (m,)
    
    # Handle division by zero
    denominator = np.maximum(denominator, 1e-10)
    
    return numerator / denominator
 
def nadaraya_watson_sparse(
    x: np.ndarray, y: np.ndarray,
    x_eval: np.ndarray,
    h: float
) -> np.ndarray:
    """
    Nadaraya-Watson using sorted arrays for efficiency with compact kernels.
    
    Only examines points within ±h of each evaluation point.
    Much faster when h is small relative to the data range.
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    x_eval = np.asarray(x_eval).flatten()
    
    # Sort training data for binary search
    order = np.argsort(x)
    x_sorted = x[order]
    y_sorted = y[order]
    n = len(x)
    
    def epanechnikov(u):
        return 0.75 * (1 - u**2) * (np.abs(u) < 1)
    
    y_hat = np.zeros(len(x_eval))
    
    for j, x0 in enumerate(x_eval):
        # Binary search for [x0 - h, x0 + h]
        left = np.searchsorted(x_sorted, x0 - h)
        right = np.searchsorted(x_sorted, x0 + h)
        
        if left >= right:
            # No points in window
            y_hat[j] = np.mean(y)
            continue
        
        # Only examine points in window
        x_window = x_sorted[left:right]
        y_window = y_sorted[left:right]
        
        u = (x_window - x0) / h
        w = epanechnikov(u)
        
        w_sum = np.sum(w)
        if w_sum > 0:
            y_hat[j] = np.sum(w * y_window) / w_sum
        else:
            y_hat[j] = np.mean(y_window)
    
    return y_hat
 
# =============================================================================
# Benchmark
# =============================================================================
np.random.seed(42)
n = 10000
m = 1000
x = np.sort(np.random.uniform(0, 100, n))
y = np.sin(x / 10) + np.random.normal(0, 0.3, n)
x_eval = np.linspace(0, 100, m)
h = 2.0
 
import time
 
# Vectorized
t0 = time.time()
y_vec = nadaraya_watson_vectorized(x, y, x_eval, h)
t_vec = time.time() - t0
 
# Sparse (exploits compact kernel)
t0 = time.time()
y_sparse = nadaraya_watson_sparse(x, y, x_eval, h)
t_sparse = time.time() - t0
 
print("Performance Comparison")
print("=" * 50)
print(f"n = {n:,}, m = {m:,}, h = {h}")
print(f"Vectorized:  {t_vec:.4f}s")
print(f"Sparse:      {t_sparse:.4f}s")
print(f"Speedup:     {t_vec / t_sparse:.1f}x")
print(f"Max diff:    {np.max(np.abs(y_vec - y_sparse)):.6f}")

Library Implementations

Kernel Smoothing vs. LOESS: When to Use Which

Kernel Smoothing vs. LOESS Comparison
Aspect	Kernel Smoothing (NW)	LOESS (Local Linear)
Local model	Constant (weighted average)	Linear (or polynomial)
Bandwidth	Fixed $h$	Adaptive (fraction $\alpha$ of data)
Boundary bias	Yes (systematic)	No (automatically corrected)
Design bias	Yes (problematic)	No (corrected)
Computation	$O(nm)$ simple	$O(nm p^3)$ matrix ops
Theory	Cleaner asymptotics	More complex
Derivatives	Not directly available	Local slope available
Outlier handling	None built-in	Robustifying iterations
Common use	Theory, quick exploration	Publication-quality smoothing

Choose Kernel Smoothing When

•Data density is roughly uniform
•Speed is critical
•Theoretical analysis is needed
•Connecting to density estimation
•Interior estimation (not at boundaries)
•Initial exploratory analysis

Choose LOESS When

•Data density varies significantly
•Boundaries matter (time series, spatial edges)
•Outliers may be present
•Local slopes are of interest
•Publication-quality visualizations
•General-purpose robust smoothing

The Practical Bottom Line

Multivariate Kernel Smoothing

Curse of Dimensionality Preview

Summary: Kernel Smoothing

We've developed a comprehensive understanding of kernel smoothing as the foundation of nonparametric regression. Let's consolidate the key concepts:

Key Takeaways

•Nadaraya-Watson estimator: A weighted average $\hat{f}(x_0) = \sum_i w_i y_i / \sum_i w_i$ where weights decay with distance via a kernel function.
•Kernel functions: Symmetric, non-negative, integrate to 1. Epanechnikov is theoretically optimal but kernel choice matters little in practice.
•Bias-variance tradeoff: Bias $\sim h^2$, variance $\sim 1/(nh)$. Optimal bandwidth balances these, giving MISE $\sim n^{-4/5}$.
•Design bias: NW suffers from bias in regions with non-uniform data density. Local linear regression eliminates this.
•Local polynomial extends kernel smoothing: Fitting local lines/polynomials reduces bias, especially at boundaries.
•LOESS vs. NW: LOESS (local linear, adaptive bandwidth) is generally preferred for its boundary and design bias corrections.
•Multivariate extension: Product kernels work but suffer from curse of dimensionality as $d$ increases.

Page Complete

2 / 5