Machine LearningK-Nearest Neighbors

Weighted KNN

LevelIntermediate

Duration90 mins

TopicK-Nearest Neighbors

4 / 5

Local Models

Beyond Local Averages

Standard KNN (even with sophisticated weighting) makes a strong assumption: within each neighborhood, the target function is approximately constant. The prediction is a weighted average—essentially assuming the function is flat locally.

But what if the function has a clear local trend? A linear slope within the neighborhood? Even curvature?

Consider predicting house prices: within a neighborhood of similar homes, prices might increase with square footage at a consistent rate. A weighted average ignores this trend, introducing bias. A local linear model captures the trend, dramatically improving predictions.

This insight leads to local polynomial regression—fitting parametric models (constant, linear, quadratic) within each neighborhood using weighted least squares. The result combines the flexibility of nonparametric methods with the efficiency of parametric modeling.

What You Will Master

By the end of this page, you will understand local constant models (Nadaraya-Watson as a special case), local linear regression and its bias-reduction properties, the famous LOESS/LOWESS algorithm, local polynomial regression of arbitrary degree, and the tradeoffs between local model complexity and variance.

Local Constant Models

Let's formalize what standard weighted KNN regression actually computes.

The Local Constant Model:

For a query point $\mathbf{x}$, we fit a constant model $\hat{f}(\mathbf{x}) = \beta_0$ by minimizing weighted squared error:

$$\hat{\beta}0 = \arg\min{\beta_0} \sum_{i=1}^{n} w_i(\mathbf{x}) \cdot (y_i - \beta_0)^2$$

where $w_i(\mathbf{x}) = K_h(\mathbf{x} - \mathbf{x}_i)$ is the kernel weight of training point $i$.

Solution:

Taking the derivative and setting to zero:

$$\hat{\beta}_0 = \frac{\sum_i w_i(\mathbf{x}) \cdot y_i}{\sum_i w_i(\mathbf{x})}$$

This is exactly the Nadaraya-Watson estimator! Weighted KNN regression is equivalent to fitting a local constant model.

Geometric Interpretation:

We're fitting a horizontal plane (0-dimensional polynomial in the output) to the weighted data, and using its height as our prediction.

The Bias Problem

Local constant models have O(h²) bias due to first-order terms in the Taylor expansion of the true function. If f(x) has a slope at the query point, averaging nearby points introduces systematic bias toward the direction with more data (typically toward the interior of the distribution).

Where Local Constant Fails:

At boundaries: Asymmetric neighborhoods cause bias toward interior values
In trending regions: Local slopes cause the average to under/over-shoot
Variable density: Points from one side dominate even when the function changes

These problems motivate fitting richer local models that can capture trends rather than just location.

Local Linear Regression

Local linear regression fits a linear model (intercept + slopes) at each query point using weighted least squares.

The Local Linear Model:

For query $\mathbf{x}$, we fit: $$\hat{f}(\mathbf{z}) = \beta_0 + \boldsymbol{\beta}_1^T (\mathbf{z} - \mathbf{x})$$

by minimizing: $$\sum_{i=1}^{n} w_i(\mathbf{x}) \cdot \left(y_i - \beta_0 - \boldsymbol{\beta}_1^T (\mathbf{x}_i - \mathbf{x})\right)^2$$

The prediction at $\mathbf{x}$ is $\hat{f}(\mathbf{x}) = \hat{\beta}_0$ (since $(\mathbf{x} - \mathbf{x}) = \mathbf{0}$).

Key Insight: We fit a local hyperplane but only use the intercept for prediction. The slopes are estimated but discarded—their role is to remove bias from the intercept estimate.

Weighted Least Squares Solution:

Define the design matrix and weights: $$\mathbf{X} = \begin{bmatrix} 1 & (\mathbf{x}_1 - \mathbf{x})^T \ \vdots & \vdots \ 1 & (\mathbf{x}_n - \mathbf{x})^T \end{bmatrix}, \quad \mathbf{W} = \text{diag}(w_1(\mathbf{x}), ..., w_n(\mathbf{x}))$$

The solution is: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{y}$$

Prediction: $\hat{f}(\mathbf{x}) = \hat{\beta}_0 = [\hat{\boldsymbol{\beta}}]_0$

local_linear_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
from typing import Tuple
 
class LocalLinearRegression:
    """
    Local linear regression using weighted least squares.
    
    At each query point, fits a linear model to nearby points
    using kernel weights, then uses the intercept as the prediction.
    """
    
    def __init__(self, bandwidth: float = None, k: int = None):
        """
        Parameters
        ----------
        bandwidth : fixed bandwidth h for kernel weights
        k : if set, use k-NN adaptive bandwidth (overrides bandwidth)
        """
        if bandwidth is None and k is None:
            raise ValueError("Must specify either bandwidth or k")
        self.bandwidth = bandwidth
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self
    
    def _get_weights(self, x_query: np.ndarray) -> Tuple[np.ndarray, float]:
        """Compute kernel weights for a query point."""
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        
        # Determine bandwidth
        if self.k is not None:
            k_idx = np.argsort(distances)[:self.k]
            h = distances[k_idx[-1]] + 1e-10
        else:
            h = self.bandwidth
        
        # Tricube kernel (common in LOESS)
        u = distances / h
        weights = np.where(u <= 1, (1 - u**3)**3, 0.0)
        
        return weights, h
    
    def _predict_one(self, x_query: np.ndarray) -> Tuple[float, dict]:
        """
        Predict for a single query point using local linear model.
        """
        n, d = self.X_train.shape
        
        weights, h = self._get_weights(x_query)
        
        # Filter to points with non-zero weight for efficiency
        active = weights > 1e-10
        if active.sum() < d + 1:
            # Not enough points for linear fit, fall back to weighted average
            if active.sum() == 0:
                return self.y_train.mean(), {'method': 'global_mean'}
            else:
                pred = np.sum(weights[active] * self.y_train[active]) / weights[active].sum()
                return pred, {'method': 'weighted_average'}
        
        X_active = self.X_train[active]
        y_active = self.y_train[active]
        w_active = weights[active]
        
        # Build design matrix: [1, (x_i - x_query)]
        X_centered = X_active - x_query
        design = np.column_stack([np.ones(len(X_active)), X_centered])
        
        # Weighted least squares: (X^T W X)^{-1} X^T W y
        W = np.diag(w_active)
        XtWX = design.T @ W @ design
        XtWy = design.T @ W @ y_active
        
        # Regularize for numerical stability
        XtWX += 1e-8 * np.eye(d + 1)
        
        try:
            beta = np.linalg.solve(XtWX, XtWy)
            prediction = beta[0]  # Intercept is our prediction
            
            metadata = {
                'method': 'local_linear',
                'bandwidth': h,
                'n_active': active.sum(),
                'slope': beta[1:],  # The estimated local slopes
            }
        except np.linalg.LinAlgError:
            # Singular matrix - fall back to weighted average
            prediction = np.sum(w_active * y_active) / w_active.sum()
            metadata = {'method': 'fallback_weighted_average'}
        
        return prediction, metadata
    
    def predict(self, X: np.ndarray, return_slopes: bool = False):
        """Predict for all query points."""
        X = np.atleast_2d(X)
        predictions = []
        all_meta = []
        
        for x in X:
            pred, meta = self._predict_one(x)
            predictions.append(pred)
            all_meta.append(meta)
        
        if return_slopes:
            return np.array(predictions), all_meta
        return np.array(predictions)
 
 
# Demonstration: Compare local constant vs local linear
np.random.seed(42)
 
# Generate data with clear linear trend
X_train = np.random.uniform(0, 10, (100, 1))
true_func = lambda x: 2 * x + np.sin(x)
y_train = true_func(X_train.flatten()) + 0.5 * np.random.randn(100)
 
# Query at boundary (where local constant is biased)
X_query = np.array([[0.5], [9.5], [5.0]])
 
print("Local Constant vs Local Linear Regression:")
print("=" * 60)
 
# Local constant (Nadaraya-Watson)
def local_constant(X_train, y_train, x_query, k):
    distances = np.linalg.norm(X_train - x_query, axis=1)
    k_idx = np.argsort(distances)[:k]
    h = distances[k_idx[-1]] + 1e-10
    u = distances[k_idx] / h
    weights = (1 - u**3)**3  # Tricube
    return np.sum(weights * y_train[k_idx]) / weights.sum()
 
# Local linear
model_linear = LocalLinearRegression(k=20)
model_linear.fit(X_train, y_train)
 
for x in X_query:
    truth = true_func(x[0])
    pred_const = local_constant(X_train, y_train, x, k=20)
    pred_linear, meta = model_linear.predict(x.reshape(1, -1), return_slopes=True)
    
    print(f"
Query x = {x[0]:.1f}:")
    print(f"  True value:     {truth:.4f}")
    print(f"  Local constant: {pred_const:.4f} (error: {abs(pred_const-truth):.4f})")
    print(f"  Local linear:   {pred_linear[0]:.4f} (error: {abs(pred_linear[0]-truth):.4f})")
    if 'slope' in meta[0]:
        print(f"  Estimated slope: {meta[0]['slope'][0]:.4f} (true: ~2.0)")

The Magic of Local Linear

Local linear regression automatically corrects for first-order bias. It achieves the same O(h²) variance as local constant but with O(h²) bias from second-order terms only (first-order terms cancel). At boundaries, this is particularly dramatic—local linear doesn't suffer the boundary bias that plagues local constant.

LOESS/LOWESS: The Industry Standard

LOESS (Locally Estimated Scatterplot Smoothing) and LOWESS (Locally Weighted Scatterplot Smoothing) are the most widely used local polynomial methods in practice, developed by Cleveland (1979) and Cleveland & Devlin (1988).

Key Features of LOESS:

Tricube kernel: Uses the tricube weight function: $$w(u) = (1 - |u|^3)^3 \text{ for } |u| \leq 1$$ This is very smooth at boundaries (zero derivative at $u = \pm 1$).
Adaptive bandwidth: Uses fraction of data ($\alpha$) rather than fixed k or h: $$k = \lceil \alpha \cdot n \rceil$$ Typical values: $\alpha \in [0.2, 0.8]$
Optional polynomial degree: Usually degree 1 (linear) or 2 (quadratic)
Robustness iteration: Optionally re-weights to downweight outliers

The LOWESS Algorithm:

for each query point x:
    1. Find k = ⌈α·n⌉ nearest neighbors
    2. Compute bandwidth h = distance to k-th neighbor
    3. Compute tricube weights w_i = (1 - (d_i/h)³)³
    4. Fit weighted polynomial (degree 1 or 2)
    5. Predict using fitted polynomial at x

loess.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from typing import Tuple, Optional
 
class LOESS:
    """
    LOESS (Locally Estimated Scatterplot Smoothing).
    
    Implements Cleveland's classic algorithm with:
    - Tricube weighting
    - Fraction-based neighborhood (span parameter)
    - Optional robustness iterations
    - Polynomial degree 1 or 2
    """
    
    def __init__(self, 
                 span: float = 0.3,
                 degree: int = 1,
                 robust_iters: int = 0):
        """
        Parameters
        ----------
        span : fraction of data to use in each local fit (α)
        degree : polynomial degree (1 = linear, 2 = quadratic)
        robust_iters : number of robustness iterations (0 = none)
        """
        if not 0 < span <= 1:
            raise ValueError("span must be in (0, 1]")
        if degree not in [1, 2]:
            raise ValueError("degree must be 1 or 2")
        
        self.span = span
        self.degree = degree
        self.robust_iters = robust_iters
        self.X_train = None
        self.y_train = None
        self.robustness_weights = None
    
    def _tricube(self, u: np.ndarray) -> np.ndarray:
        """Tricube weight function."""
        return np.where(u < 1, (1 - u**3)**3, 0.0)
    
    def _bisquare(self, u: np.ndarray) -> np.ndarray:
        """Bisquare weight for robustness."""
        return np.where(u < 1, (1 - u**2)**2, 0.0)
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit LOESS model (store data and optionally compute robustness weights)."""
        self.X_train = np.array(X).reshape(-1, 1) if X.ndim == 1 else np.array(X)
        self.y_train = np.array(y)
        
        n = len(self.y_train)
        self.robustness_weights = np.ones(n)
        
        if self.robust_iters > 0:
            # Perform robustness iterations
            for iteration in range(self.robust_iters):
                # Get fitted values at training points
                fitted = self._predict_many(self.X_train, self.robustness_weights)
                residuals = self.y_train - fitted
                
                # Median absolute deviation
                mad = np.median(np.abs(residuals))
                if mad < 1e-10:
                    break
                
                # Normalized residuals
                u = np.abs(residuals) / (6 * mad)
                
                # Update robustness weights using bisquare
                self.robustness_weights = self._bisquare(u)
        
        return self
    
    def _predict_one(self, x_query: np.ndarray, 
                      robustness_weights: np.ndarray) -> float:
        """Predict for a single point."""
        n, d = self.X_train.shape
        k = max(int(np.ceil(self.span * n)), d + 2)  # At least enough for fit
        
        # Distances
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        
        # Find k nearest
        k_idx = np.argsort(distances)[:k]
        h = distances[k_idx[-1]] + 1e-10
        
        # Tricube weights
        u = distances[k_idx] / h
        weights = self._tricube(u) * robustness_weights[k_idx]
        
        # Build design matrix
        X_local = self.X_train[k_idx]
        y_local = self.y_train[k_idx]
        
        X_centered = X_local - x_query
        
        if self.degree == 1:
            design = np.column_stack([np.ones(k), X_centered])
        else:  # degree == 2
            # Add quadratic terms
            if d == 1:
                design = np.column_stack([
                    np.ones(k), 
                    X_centered, 
                    X_centered**2
                ])
            else:
                # For multivariate: include all quadratic terms
                quad_terms = []
                for i in range(d):
                    for j in range(i, d):
                        quad_terms.append(X_centered[:, i] * X_centered[:, j])
                design = np.column_stack([
                    np.ones(k), 
                    X_centered, 
                    np.column_stack(quad_terms)
                ])
        
        # Weighted least squares
        W = np.diag(weights)
        XtWX = design.T @ W @ design + 1e-8 * np.eye(design.shape[1])
        XtWy = design.T @ W @ y_local
        
        try:
            beta = np.linalg.solve(XtWX, XtWy)
            return beta[0]  # Intercept at query point
        except np.linalg.LinAlgError:
            return np.sum(weights * y_local) / (weights.sum() + 1e-10)
    
    def _predict_many(self, X: np.ndarray, 
                       robustness_weights: np.ndarray) -> np.ndarray:
        """Predict for multiple points."""
        X = np.atleast_2d(X)
        return np.array([self._predict_one(x, robustness_weights) for x in X])
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict for query points using fitted model."""
        X = np.atleast_2d(X)
        return self._predict_many(X, self.robustness_weights)
 
 
# Demonstration: LOESS with different spans and degrees
np.random.seed(42)
 
# Generate noisy data with nonlinear pattern
X_train = np.linspace(0, 2*np.pi, 50).reshape(-1, 1)
y_train = np.sin(X_train.flatten()) + 0.3 * np.random.randn(50)
 
# Add outliers
y_train[10] += 3  # Positive outlier
y_train[40] -= 3  # Negative outlier
 
X_query = np.linspace(0, 2*np.pi, 100).reshape(-1, 1)
y_true = np.sin(X_query.flatten())
 
print("LOESS Comparison:")
print("=" * 60)
 
configs = [
    {'span': 0.2, 'degree': 1, 'robust_iters': 0},
    {'span': 0.4, 'degree': 1, 'robust_iters': 0},
    {'span': 0.4, 'degree': 2, 'robust_iters': 0},
    {'span': 0.4, 'degree': 1, 'robust_iters': 3},
]
 
for cfg in configs:
    model = LOESS(**cfg)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_query)
    mse = np.mean((y_pred - y_true)**2)
    
    print(f"
span={cfg['span']}, degree={cfg['degree']}, robust_iters={cfg['robust_iters']}")
    print(f"  MSE: {mse:.4f}")

Choosing the Span

The span parameter controls bias-variance tradeoff just like k or h. Use smaller spans (0.1-0.3) for capturing fine detail, larger spans (0.5-0.8) for overall trends. Cross-validation can select optimal span. For data exploration, try multiple spans to understand the signal at different scales.

Local Polynomial Regression: Theory

Local polynomial regression generalizes to degree $p$ polynomials, fitting:

$$\hat{f}(\mathbf{z}) = \sum_{|\alpha| \leq p} \beta_\alpha (\mathbf{z} - \mathbf{x})^\alpha$$

at each query point $\mathbf{x}$ using weighted least squares.

Bias-Variance Tradeoffs Across Degrees:

Degree $p$	Bias Order	Variance	Boundary Bias
0 (constant)	$O(h^2)$	Lowest	Severe
1 (linear)	$O(h^2)$	Low	Eliminated
2 (quadratic)	$O(h^4)$	Moderate	Eliminated
3 (cubic)	$O(h^4)$	Higher	Eliminated

Key Theoretical Results:

Odd vs Even Degrees: Local polynomial estimators of degree $p$ have bias of order $O(h^{p+1})$ when $p$ is odd, and $O(h^{p+2})$ when $p$ is even. This gives odd degrees an advantage.
Design Adaptive: Local polynomial estimators automatically adapt to design (clustering of X values) without requiring explicit handling.
Boundary Correction: Polynomials of degree ≥ 1 automatically correct boundary bias without needing special boundary kernels.
Minimax Optimality: Local linear estimators are asymptotically minimax optimal in a certain function class.

Why Not Higher Degrees?

Beyond degree 2, the variance increase typically outweighs bias reduction benefits. Local cubic or higher fits become numerically unstable and overfit the local data. In practice, degree 1 (linear) is the sweet spot for most applications—degree 2 is used only when you have evidence of significant local curvature and sufficient data.

Equivalent Kernels:

A remarkable result: every local polynomial estimator can be expressed as a linear smoother with an effective kernel:

$$\hat{f}(\mathbf{x}) = \sum_{i=1}^{n} K^*_i(\mathbf{x}) \cdot y_i$$

where $K^*_i(\mathbf{x})$ is the equivalent kernel weight (different from the original kernel $K$).

For local linear regression: $$K^*_i(\mathbf{x}) = \frac{K_h(\mathbf{x} - \mathbf{x}_i)}{\sum_j K_h(\mathbf{x} - \mathbf{x}_j)} \cdot \left(1 + \frac{(\mathbf{x} - \bar{\mathbf{x}}_w)^T (\mathbf{x}_i - \bar{\mathbf{x}}_w)}{\text{var}_w(\mathbf{X})}\right)$$

where $\bar{\mathbf{x}}_w$ and $\text{var}_w$ are weighted mean and variance.

Interpretation: Local linear regression implicitly constructs an asymmetric equivalent kernel that tilts toward the direction of the query point, correcting for bias.

Computational Considerations

Local polynomial regression is computationally more demanding than simple weighted averages. Let's analyze the costs.

Per-Query Computation (Local Linear, 1D):

Find k nearest neighbors: $O(n)$ or $O(\log n)$ with spatial index
Build design matrix: $O(k)$
Form normal equations: $O(k \cdot p^2)$ where $p$ = polynomial degree + 1
Solve linear system: $O(p^3)$

For degree 1 in 1D: $O(k)$ per query For degree 2 in $d$ dimensions: $O(k \cdot d^2 + d^6)$ per query

Total for $m$ queries: $O(m \cdot (k + p^3))$ assuming k neighbors are pre-found.

Comparison to Simple Weighted KNN:

Simple weighted KNN: $O(m \cdot k)$ Local linear: $O(m \cdot k)$ (linear system is small) Local quadratic: $O(m \cdot (k \cdot d^2 + d^6))$

The overhead is minimal for low-dimensional data but grows rapidly with dimension.

Optimization Strategies

•Spatial indexing: Use KD-trees or ball trees to find k neighbors in O(log n) instead of O(n)
•Sparse weights: With compact kernels, most weights are zero. Only include non-zero weight points in the linear system.
•Binning: For very large datasets, pre-bin data and use bin centers as pseudo-observations
•GPU acceleration: Batched matrix operations on GPU for many simultaneous queries
•Approximate methods: For extremely large data, use random subsampling of neighbors

efficient_local_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.spatial import KDTree
import time
 
class EfficientLocalLinear:
    """
    Efficient local linear regression using KD-tree for neighbor lookup.
    """
    
    def __init__(self, k: int = 20):
        self.k = k
        self.tree = None
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Build KD-tree index."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        self.tree = KDTree(self.X_train)
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict using KD-tree accelerated local linear regression."""
        X = np.atleast_2d(X)
        predictions = np.zeros(len(X))
        
        # Batch query for all nearest neighbors at once
        distances, indices = self.tree.query(X, k=self.k)
        
        for i, (x_query, dists, idx) in enumerate(zip(X, distances, indices)):
            # Tricube weights
            h = dists[-1] + 1e-10
            u = dists / h
            weights = np.where(u < 1, (1 - u**3)**3, 0.0)
            
            # Get local data
            X_local = self.X_train[idx]
            y_local = self.y_train[idx]
            
            # Build design matrix
            X_centered = X_local - x_query
            n_local, d = X_centered.shape
            design = np.column_stack([np.ones(n_local), X_centered])
            
            # Weighted least squares (optimized for small system)
            W_diag = weights
            XtW = design.T * W_diag  # Broadcast multiplication
            XtWX = XtW @ design
            XtWy = XtW @ y_local
            
            # Solve (with regularization)
            XtWX += 1e-8 * np.eye(d + 1)
            try:
                beta = np.linalg.solve(XtWX, XtWy)
                predictions[i] = beta[0]
            except np.linalg.LinAlgError:
                predictions[i] = np.sum(weights * y_local) / (weights.sum() + 1e-10)
        
        return predictions
 
 
# Benchmark: efficient vs naive implementation
np.random.seed(42)
 
for n_train in [1000, 5000, 10000]:
    for d in [2, 5]:
        X_train = np.random.randn(n_train, d)
        y_train = X_train.sum(axis=1) + 0.5 * np.random.randn(n_train)
        X_test = np.random.randn(100, d)
        
        model = EfficientLocalLinear(k=30)
        model.fit(X_train, y_train)
        
        start = time.time()
        predictions = model.predict(X_test)
        elapsed = time.time() - start
        
        print(f"n={n_train}, d={d}: {elapsed*1000:.2f} ms for 100 predictions")

Connections to Other Methods

Local polynomial regression connects to numerous other methods in machine learning and statistics.

Connection to Splines:

Smoothing splines can be viewed as a global version of local polynomial regression. While local methods fit polynomials in sliding windows, splines fit piecewise polynomials with continuity constraints. The connection:

LOESS with span → 1 approaches a global polynomial fit
Splines can be represented as weighted combinations of local basis functions
Both are linear smoothers with different effective kernels

Connection to Gaussian Processes:

Gaussian Process (GP) regression with a polynomial mean function and local covariance structure resembles local polynomial regression:

GPs provide uncertainty quantification that local methods lack
GPs can be viewed as infinitely flexible local models
Computational advantage of local poly: $O(k^3)$ vs GP's $O(n^3)$

Connection to Random Forests:

Random forest predictions are local averages over tree partitions:

Trees create adaptive neighborhoods (leaves)
Forest averaging is a form of kernel smoothing
Honesty in random forests relates to local constant estimation

Comparing Local Polynomial to Related Methods
Method	Local Model	Neighborhood	Strengths
Nadaraya-Watson	Constant	Kernel bandwidth	Simple, low variance
Local Linear	Linear	Kernel bandwidth	Boundary correction, low bias
LOESS	Linear/Quadratic	Span-based	Robust, well-studied
Smoothing Splines	Cubic pieces	Global with local influence	Smooth, automatic
Gaussian Process	Infinite-dimensional	Covariance-based	Uncertainty quantification
Random Forest	Constant in leaves	Tree partitions	Feature selection, scalable

When to Use Local Polynomial

Use local polynomial regression when: (1) you need interpretable local structure (slopes, curvature), (2) boundary behavior matters, (3) the data is small-to-medium sized (thousands to hundreds of thousands), (4) you want direct control over smoothness through span/bandwidth.

Local Models for Classification

Local polynomial ideas extend to classification through local logistic regression or locally weighted learning.

Local Logistic Regression:

For binary classification, fit a logistic model locally:

$$P(Y = 1 | \mathbf{x}) = \sigma(\beta_0 + \boldsymbol{\beta}_1^T (\mathbf{x}_i - \mathbf{x}))$$

where $\sigma$ is the sigmoid function.

Minimize weighted negative log-likelihood: $$\sum_i w_i(\mathbf{x}) \cdot \left[-y_i \log \hat{p}_i - (1-y_i) \log(1-\hat{p}_i)\right]$$

Predict: $\hat{P}(Y=1 | \mathbf{x}) = \sigma(\hat{\beta}_0)$

Benefits of Local Logistic:

Probability estimates respect [0, 1] bounds (unlike local linear on 0/1 labels)
Captures local class structure (slopes in log-odds)
Reduces boundary bias like local linear regression

Implementation Complexity:

Local logistic regression requires iterative optimization (Newton-Raphson or gradient descent) at each query point, making it significantly more expensive than local linear regression. In practice, often approximated by local linear regression on log-transformed probabilities.

local_logistic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy.optimize import minimize
 
class LocalLogisticClassifier:
    """
    Local logistic regression classifier.
    
    Fits a weighted logistic regression at each query point.
    """
    
    def __init__(self, k: int = 30):
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y).astype(float)
        return self
    
    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def _predict_one(self, x_query: np.ndarray) -> float:
        """Predict probability for one query point."""
        n, d = self.X_train.shape
        
        # Find k neighbors
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        k_idx = np.argsort(distances)[:self.k]
        h = distances[k_idx[-1]] + 1e-10
        
        # Tricube weights
        u = distances[k_idx] / h
        weights = np.where(u < 1, (1 - u**3)**3, 0.0)
        
        X_local = self.X_train[k_idx]
        y_local = self.y_train[k_idx]
        
        # Centered design matrix
        X_centered = X_local - x_query
        
        # Weighted negative log-likelihood
        def neg_log_likelihood(beta):
            linear = beta[0] + X_centered @ beta[1:]
            p = self._sigmoid(linear)
            p = np.clip(p, 1e-10, 1-1e-10)
            ll = weights @ (y_local * np.log(p) + (1-y_local) * np.log(1-p))
            # Add small L2 regularization for stability
            reg = 0.01 * np.sum(beta[1:]**2)
            return -ll + reg
        
        # Optimize
        beta_init = np.zeros(d + 1)
        result = minimize(neg_log_likelihood, beta_init, method='BFGS')
        
        # Prediction at query point (where x - x_query = 0)
        prob = self._sigmoid(result.x[0])
        return prob
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict probabilities."""
        X = np.atleast_2d(X)
        probs = np.array([self._predict_one(x) for x in X])
        return np.column_stack([1 - probs, probs])
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return (proba[:, 1] > 0.5).astype(int)
 
 
# Demonstration: XOR-like pattern
np.random.seed(42)
 
# Generate XOR pattern
n = 200
X_train = np.random.uniform(-2, 2, (n, 2))
y_train = ((X_train[:, 0] * X_train[:, 1]) > 0).astype(int)
 
# Add noise
flip_idx = np.random.choice(n, size=20, replace=False)
y_train[flip_idx] = 1 - y_train[flip_idx]
 
# Test points
X_test = np.array([
    [0.5, 0.5],   # Class 1 region
    [-0.5, -0.5], # Class 1 region
    [0.5, -0.5],  # Class 0 region
    [0, 0],       # Boundary
])
 
model = LocalLogisticClassifier(k=30)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)
 
print("Local Logistic Classification (XOR pattern):")
print("=" * 50)
for x, p in zip(X_test, probas):
    print(f"x = {x}: P(class=0) = {p[0]:.3f}, P(class=1) = {p[1]:.3f}")

Summary: Local Models

Key Takeaways

•Local constant = Nadaraya-Watson — Standard weighted KNN regression is equivalent to fitting a local constant model.
•Local linear fixes boundary bias — By estimating local slopes, local linear regression eliminates first-order bias at boundaries.
•LOESS is the practical gold standard — Tricube weights, span-based neighborhoods, and optional robustness make it reliable.
•Higher degree trades variance for bias — Degree 2 removes more bias but increases variance; beyond degree 2 is rarely useful.
•Equivalent kernels reveal structure — Every local polynomial estimator is a linear smoother with an implicit equivalent kernel.
•Computation scales with polynomial degree — Local linear is cheap; local quadratic becomes expensive in high dimensions.
•Connections to other methods abound — Links to splines, GPs, and random forests illuminate the broader landscape.
•Local logistic extends to classification — Fits weighted logistic regression locally for calibrated probabilities.

Page Complete

You now understand how local models elevate KNN from simple averaging to principled local modeling with bias-variance control. The final page explores Regression with KNN—examining how all these techniques come together for regression tasks and comparing them to other regression approaches.

4 / 5

Loading learning content...

Machine LearningK-Nearest Neighbors

Weighted KNN

LevelIntermediate

Duration90 mins

TopicK-Nearest Neighbors

4 / 5

Local Models

Beyond Local Averages

But what if the function has a clear local trend? A linear slope within the neighborhood? Even curvature?

What You Will Master

Local Constant Models

Let's formalize what standard weighted KNN regression actually computes.

The Local Constant Model:

For a query point $\mathbf{x}$, we fit a constant model $\hat{f}(\mathbf{x}) = \beta_0$ by minimizing weighted squared error:

$$\hat{\beta}0 = \arg\min{\beta_0} \sum_{i=1}^{n} w_i(\mathbf{x}) \cdot (y_i - \beta_0)^2$$

where $w_i(\mathbf{x}) = K_h(\mathbf{x} - \mathbf{x}_i)$ is the kernel weight of training point $i$.

Solution:

Taking the derivative and setting to zero:

$$\hat{\beta}_0 = \frac{\sum_i w_i(\mathbf{x}) \cdot y_i}{\sum_i w_i(\mathbf{x})}$$

This is exactly the Nadaraya-Watson estimator! Weighted KNN regression is equivalent to fitting a local constant model.

Geometric Interpretation:

We're fitting a horizontal plane (0-dimensional polynomial in the output) to the weighted data, and using its height as our prediction.

The Bias Problem

Where Local Constant Fails:

At boundaries: Asymmetric neighborhoods cause bias toward interior values
In trending regions: Local slopes cause the average to under/over-shoot
Variable density: Points from one side dominate even when the function changes

These problems motivate fitting richer local models that can capture trends rather than just location.

Local Linear Regression

Local linear regression fits a linear model (intercept + slopes) at each query point using weighted least squares.

The Local Linear Model:

For query $\mathbf{x}$, we fit: $$\hat{f}(\mathbf{z}) = \beta_0 + \boldsymbol{\beta}_1^T (\mathbf{z} - \mathbf{x})$$

by minimizing: $$\sum_{i=1}^{n} w_i(\mathbf{x}) \cdot \left(y_i - \beta_0 - \boldsymbol{\beta}_1^T (\mathbf{x}_i - \mathbf{x})\right)^2$$

The prediction at $\mathbf{x}$ is $\hat{f}(\mathbf{x}) = \hat{\beta}_0$ (since $(\mathbf{x} - \mathbf{x}) = \mathbf{0}$).

Key Insight: We fit a local hyperplane but only use the intercept for prediction. The slopes are estimated but discarded—their role is to remove bias from the intercept estimate.

Weighted Least Squares Solution:

The solution is: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{y}$$

Prediction: $\hat{f}(\mathbf{x}) = \hat{\beta}_0 = [\hat{\boldsymbol{\beta}}]_0$

local_linear_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
from typing import Tuple
 
class LocalLinearRegression:
    """
    Local linear regression using weighted least squares.
    
    At each query point, fits a linear model to nearby points
    using kernel weights, then uses the intercept as the prediction.
    """
    
    def __init__(self, bandwidth: float = None, k: int = None):
        """
        Parameters
        ----------
        bandwidth : fixed bandwidth h for kernel weights
        k : if set, use k-NN adaptive bandwidth (overrides bandwidth)
        """
        if bandwidth is None and k is None:
            raise ValueError("Must specify either bandwidth or k")
        self.bandwidth = bandwidth
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self
    
    def _get_weights(self, x_query: np.ndarray) -> Tuple[np.ndarray, float]:
        """Compute kernel weights for a query point."""
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        
        # Determine bandwidth
        if self.k is not None:
            k_idx = np.argsort(distances)[:self.k]
            h = distances[k_idx[-1]] + 1e-10
        else:
            h = self.bandwidth
        
        # Tricube kernel (common in LOESS)
        u = distances / h
        weights = np.where(u <= 1, (1 - u**3)**3, 0.0)
        
        return weights, h
    
    def _predict_one(self, x_query: np.ndarray) -> Tuple[float, dict]:
        """
        Predict for a single query point using local linear model.
        """
        n, d = self.X_train.shape
        
        weights, h = self._get_weights(x_query)
        
        # Filter to points with non-zero weight for efficiency
        active = weights > 1e-10
        if active.sum() < d + 1:
            # Not enough points for linear fit, fall back to weighted average
            if active.sum() == 0:
                return self.y_train.mean(), {'method': 'global_mean'}
            else:
                pred = np.sum(weights[active] * self.y_train[active]) / weights[active].sum()
                return pred, {'method': 'weighted_average'}
        
        X_active = self.X_train[active]
        y_active = self.y_train[active]
        w_active = weights[active]
        
        # Build design matrix: [1, (x_i - x_query)]
        X_centered = X_active - x_query
        design = np.column_stack([np.ones(len(X_active)), X_centered])
        
        # Weighted least squares: (X^T W X)^{-1} X^T W y
        W = np.diag(w_active)
        XtWX = design.T @ W @ design
        XtWy = design.T @ W @ y_active
        
        # Regularize for numerical stability
        XtWX += 1e-8 * np.eye(d + 1)
        
        try:
            beta = np.linalg.solve(XtWX, XtWy)
            prediction = beta[0]  # Intercept is our prediction
            
            metadata = {
                'method': 'local_linear',
                'bandwidth': h,
                'n_active': active.sum(),
                'slope': beta[1:],  # The estimated local slopes
            }
        except np.linalg.LinAlgError:
            # Singular matrix - fall back to weighted average
            prediction = np.sum(w_active * y_active) / w_active.sum()
            metadata = {'method': 'fallback_weighted_average'}
        
        return prediction, metadata
    
    def predict(self, X: np.ndarray, return_slopes: bool = False):
        """Predict for all query points."""
        X = np.atleast_2d(X)
        predictions = []
        all_meta = []
        
        for x in X:
            pred, meta = self._predict_one(x)
            predictions.append(pred)
            all_meta.append(meta)
        
        if return_slopes:
            return np.array(predictions), all_meta
        return np.array(predictions)
 
 
# Demonstration: Compare local constant vs local linear
np.random.seed(42)
 
# Generate data with clear linear trend
X_train = np.random.uniform(0, 10, (100, 1))
true_func = lambda x: 2 * x + np.sin(x)
y_train = true_func(X_train.flatten()) + 0.5 * np.random.randn(100)
 
# Query at boundary (where local constant is biased)
X_query = np.array([[0.5], [9.5], [5.0]])
 
print("Local Constant vs Local Linear Regression:")
print("=" * 60)
 
# Local constant (Nadaraya-Watson)
def local_constant(X_train, y_train, x_query, k):
    distances = np.linalg.norm(X_train - x_query, axis=1)
    k_idx = np.argsort(distances)[:k]
    h = distances[k_idx[-1]] + 1e-10
    u = distances[k_idx] / h
    weights = (1 - u**3)**3  # Tricube
    return np.sum(weights * y_train[k_idx]) / weights.sum()
 
# Local linear
model_linear = LocalLinearRegression(k=20)
model_linear.fit(X_train, y_train)
 
for x in X_query:
    truth = true_func(x[0])
    pred_const = local_constant(X_train, y_train, x, k=20)
    pred_linear, meta = model_linear.predict(x.reshape(1, -1), return_slopes=True)
    
    print(f"
Query x = {x[0]:.1f}:")
    print(f"  True value:     {truth:.4f}")
    print(f"  Local constant: {pred_const:.4f} (error: {abs(pred_const-truth):.4f})")
    print(f"  Local linear:   {pred_linear[0]:.4f} (error: {abs(pred_linear[0]-truth):.4f})")
    if 'slope' in meta[0]:
        print(f"  Estimated slope: {meta[0]['slope'][0]:.4f} (true: ~2.0)")

The Magic of Local Linear

LOESS/LOWESS: The Industry Standard

Key Features of LOESS:

Tricube kernel: Uses the tricube weight function: $$w(u) = (1 - |u|^3)^3 \text{ for } |u| \leq 1$$ This is very smooth at boundaries (zero derivative at $u = \pm 1$).
Adaptive bandwidth: Uses fraction of data ($\alpha$) rather than fixed k or h: $$k = \lceil \alpha \cdot n \rceil$$ Typical values: $\alpha \in [0.2, 0.8]$
Optional polynomial degree: Usually degree 1 (linear) or 2 (quadratic)
Robustness iteration: Optionally re-weights to downweight outliers

The LOWESS Algorithm:

for each query point x:
    1. Find k = ⌈α·n⌉ nearest neighbors
    2. Compute bandwidth h = distance to k-th neighbor
    3. Compute tricube weights w_i = (1 - (d_i/h)³)³
    4. Fit weighted polynomial (degree 1 or 2)
    5. Predict using fitted polynomial at x

loess.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from typing import Tuple, Optional
 
class LOESS:
    """
    LOESS (Locally Estimated Scatterplot Smoothing).
    
    Implements Cleveland's classic algorithm with:
    - Tricube weighting
    - Fraction-based neighborhood (span parameter)
    - Optional robustness iterations
    - Polynomial degree 1 or 2
    """
    
    def __init__(self, 
                 span: float = 0.3,
                 degree: int = 1,
                 robust_iters: int = 0):
        """
        Parameters
        ----------
        span : fraction of data to use in each local fit (α)
        degree : polynomial degree (1 = linear, 2 = quadratic)
        robust_iters : number of robustness iterations (0 = none)
        """
        if not 0 < span <= 1:
            raise ValueError("span must be in (0, 1]")
        if degree not in [1, 2]:
            raise ValueError("degree must be 1 or 2")
        
        self.span = span
        self.degree = degree
        self.robust_iters = robust_iters
        self.X_train = None
        self.y_train = None
        self.robustness_weights = None
    
    def _tricube(self, u: np.ndarray) -> np.ndarray:
        """Tricube weight function."""
        return np.where(u < 1, (1 - u**3)**3, 0.0)
    
    def _bisquare(self, u: np.ndarray) -> np.ndarray:
        """Bisquare weight for robustness."""
        return np.where(u < 1, (1 - u**2)**2, 0.0)
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit LOESS model (store data and optionally compute robustness weights)."""
        self.X_train = np.array(X).reshape(-1, 1) if X.ndim == 1 else np.array(X)
        self.y_train = np.array(y)
        
        n = len(self.y_train)
        self.robustness_weights = np.ones(n)
        
        if self.robust_iters > 0:
            # Perform robustness iterations
            for iteration in range(self.robust_iters):
                # Get fitted values at training points
                fitted = self._predict_many(self.X_train, self.robustness_weights)
                residuals = self.y_train - fitted
                
                # Median absolute deviation
                mad = np.median(np.abs(residuals))
                if mad < 1e-10:
                    break
                
                # Normalized residuals
                u = np.abs(residuals) / (6 * mad)
                
                # Update robustness weights using bisquare
                self.robustness_weights = self._bisquare(u)
        
        return self
    
    def _predict_one(self, x_query: np.ndarray, 
                      robustness_weights: np.ndarray) -> float:
        """Predict for a single point."""
        n, d = self.X_train.shape
        k = max(int(np.ceil(self.span * n)), d + 2)  # At least enough for fit
        
        # Distances
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        
        # Find k nearest
        k_idx = np.argsort(distances)[:k]
        h = distances[k_idx[-1]] + 1e-10
        
        # Tricube weights
        u = distances[k_idx] / h
        weights = self._tricube(u) * robustness_weights[k_idx]
        
        # Build design matrix
        X_local = self.X_train[k_idx]
        y_local = self.y_train[k_idx]
        
        X_centered = X_local - x_query
        
        if self.degree == 1:
            design = np.column_stack([np.ones(k), X_centered])
        else:  # degree == 2
            # Add quadratic terms
            if d == 1:
                design = np.column_stack([
                    np.ones(k), 
                    X_centered, 
                    X_centered**2
                ])
            else:
                # For multivariate: include all quadratic terms
                quad_terms = []
                for i in range(d):
                    for j in range(i, d):
                        quad_terms.append(X_centered[:, i] * X_centered[:, j])
                design = np.column_stack([
                    np.ones(k), 
                    X_centered, 
                    np.column_stack(quad_terms)
                ])
        
        # Weighted least squares
        W = np.diag(weights)
        XtWX = design.T @ W @ design + 1e-8 * np.eye(design.shape[1])
        XtWy = design.T @ W @ y_local
        
        try:
            beta = np.linalg.solve(XtWX, XtWy)
            return beta[0]  # Intercept at query point
        except np.linalg.LinAlgError:
            return np.sum(weights * y_local) / (weights.sum() + 1e-10)
    
    def _predict_many(self, X: np.ndarray, 
                       robustness_weights: np.ndarray) -> np.ndarray:
        """Predict for multiple points."""
        X = np.atleast_2d(X)
        return np.array([self._predict_one(x, robustness_weights) for x in X])
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict for query points using fitted model."""
        X = np.atleast_2d(X)
        return self._predict_many(X, self.robustness_weights)
 
 
# Demonstration: LOESS with different spans and degrees
np.random.seed(42)
 
# Generate noisy data with nonlinear pattern
X_train = np.linspace(0, 2*np.pi, 50).reshape(-1, 1)
y_train = np.sin(X_train.flatten()) + 0.3 * np.random.randn(50)
 
# Add outliers
y_train[10] += 3  # Positive outlier
y_train[40] -= 3  # Negative outlier
 
X_query = np.linspace(0, 2*np.pi, 100).reshape(-1, 1)
y_true = np.sin(X_query.flatten())
 
print("LOESS Comparison:")
print("=" * 60)
 
configs = [
    {'span': 0.2, 'degree': 1, 'robust_iters': 0},
    {'span': 0.4, 'degree': 1, 'robust_iters': 0},
    {'span': 0.4, 'degree': 2, 'robust_iters': 0},
    {'span': 0.4, 'degree': 1, 'robust_iters': 3},
]
 
for cfg in configs:
    model = LOESS(**cfg)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_query)
    mse = np.mean((y_pred - y_true)**2)
    
    print(f"
span={cfg['span']}, degree={cfg['degree']}, robust_iters={cfg['robust_iters']}")
    print(f"  MSE: {mse:.4f}")

Choosing the Span

Local Polynomial Regression: Theory

Local polynomial regression generalizes to degree $p$ polynomials, fitting:

$$\hat{f}(\mathbf{z}) = \sum_{|\alpha| \leq p} \beta_\alpha (\mathbf{z} - \mathbf{x})^\alpha$$

at each query point $\mathbf{x}$ using weighted least squares.

Bias-Variance Tradeoffs Across Degrees:

Degree $p$	Bias Order	Variance	Boundary Bias
0 (constant)	$O(h^2)$	Lowest	Severe
1 (linear)	$O(h^2)$	Low	Eliminated
2 (quadratic)	$O(h^4)$	Moderate	Eliminated
3 (cubic)	$O(h^4)$	Higher	Eliminated

Key Theoretical Results:

Odd vs Even Degrees: Local polynomial estimators of degree $p$ have bias of order $O(h^{p+1})$ when $p$ is odd, and $O(h^{p+2})$ when $p$ is even. This gives odd degrees an advantage.
Design Adaptive: Local polynomial estimators automatically adapt to design (clustering of X values) without requiring explicit handling.
Boundary Correction: Polynomials of degree ≥ 1 automatically correct boundary bias without needing special boundary kernels.
Minimax Optimality: Local linear estimators are asymptotically minimax optimal in a certain function class.

Why Not Higher Degrees?

Equivalent Kernels:

A remarkable result: every local polynomial estimator can be expressed as a linear smoother with an effective kernel:

$$\hat{f}(\mathbf{x}) = \sum_{i=1}^{n} K^*_i(\mathbf{x}) \cdot y_i$$

where $K^*_i(\mathbf{x})$ is the equivalent kernel weight (different from the original kernel $K$).

where $\bar{\mathbf{x}}_w$ and $\text{var}_w$ are weighted mean and variance.

Interpretation: Local linear regression implicitly constructs an asymmetric equivalent kernel that tilts toward the direction of the query point, correcting for bias.

Computational Considerations

Local polynomial regression is computationally more demanding than simple weighted averages. Let's analyze the costs.

Per-Query Computation (Local Linear, 1D):

Find k nearest neighbors: $O(n)$ or $O(\log n)$ with spatial index
Build design matrix: $O(k)$
Form normal equations: $O(k \cdot p^2)$ where $p$ = polynomial degree + 1
Solve linear system: $O(p^3)$

For degree 1 in 1D: $O(k)$ per query For degree 2 in $d$ dimensions: $O(k \cdot d^2 + d^6)$ per query

Total for $m$ queries: $O(m \cdot (k + p^3))$ assuming k neighbors are pre-found.

Comparison to Simple Weighted KNN:

Simple weighted KNN: $O(m \cdot k)$ Local linear: $O(m \cdot k)$ (linear system is small) Local quadratic: $O(m \cdot (k \cdot d^2 + d^6))$

The overhead is minimal for low-dimensional data but grows rapidly with dimension.

Optimization Strategies

•Spatial indexing: Use KD-trees or ball trees to find k neighbors in O(log n) instead of O(n)
•Sparse weights: With compact kernels, most weights are zero. Only include non-zero weight points in the linear system.
•Binning: For very large datasets, pre-bin data and use bin centers as pseudo-observations
•GPU acceleration: Batched matrix operations on GPU for many simultaneous queries
•Approximate methods: For extremely large data, use random subsampling of neighbors

efficient_local_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.spatial import KDTree
import time
 
class EfficientLocalLinear:
    """
    Efficient local linear regression using KD-tree for neighbor lookup.
    """
    
    def __init__(self, k: int = 20):
        self.k = k
        self.tree = None
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Build KD-tree index."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        self.tree = KDTree(self.X_train)
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict using KD-tree accelerated local linear regression."""
        X = np.atleast_2d(X)
        predictions = np.zeros(len(X))
        
        # Batch query for all nearest neighbors at once
        distances, indices = self.tree.query(X, k=self.k)
        
        for i, (x_query, dists, idx) in enumerate(zip(X, distances, indices)):
            # Tricube weights
            h = dists[-1] + 1e-10
            u = dists / h
            weights = np.where(u < 1, (1 - u**3)**3, 0.0)
            
            # Get local data
            X_local = self.X_train[idx]
            y_local = self.y_train[idx]
            
            # Build design matrix
            X_centered = X_local - x_query
            n_local, d = X_centered.shape
            design = np.column_stack([np.ones(n_local), X_centered])
            
            # Weighted least squares (optimized for small system)
            W_diag = weights
            XtW = design.T * W_diag  # Broadcast multiplication
            XtWX = XtW @ design
            XtWy = XtW @ y_local
            
            # Solve (with regularization)
            XtWX += 1e-8 * np.eye(d + 1)
            try:
                beta = np.linalg.solve(XtWX, XtWy)
                predictions[i] = beta[0]
            except np.linalg.LinAlgError:
                predictions[i] = np.sum(weights * y_local) / (weights.sum() + 1e-10)
        
        return predictions
 
 
# Benchmark: efficient vs naive implementation
np.random.seed(42)
 
for n_train in [1000, 5000, 10000]:
    for d in [2, 5]:
        X_train = np.random.randn(n_train, d)
        y_train = X_train.sum(axis=1) + 0.5 * np.random.randn(n_train)
        X_test = np.random.randn(100, d)
        
        model = EfficientLocalLinear(k=30)
        model.fit(X_train, y_train)
        
        start = time.time()
        predictions = model.predict(X_test)
        elapsed = time.time() - start
        
        print(f"n={n_train}, d={d}: {elapsed*1000:.2f} ms for 100 predictions")

Connections to Other Methods

Local polynomial regression connects to numerous other methods in machine learning and statistics.

Connection to Splines:

LOESS with span → 1 approaches a global polynomial fit
Splines can be represented as weighted combinations of local basis functions
Both are linear smoothers with different effective kernels

Connection to Gaussian Processes:

Gaussian Process (GP) regression with a polynomial mean function and local covariance structure resembles local polynomial regression:

GPs provide uncertainty quantification that local methods lack
GPs can be viewed as infinitely flexible local models
Computational advantage of local poly: $O(k^3)$ vs GP's $O(n^3)$

Connection to Random Forests:

Random forest predictions are local averages over tree partitions:

Trees create adaptive neighborhoods (leaves)
Forest averaging is a form of kernel smoothing
Honesty in random forests relates to local constant estimation

Comparing Local Polynomial to Related Methods
Method	Local Model	Neighborhood	Strengths
Nadaraya-Watson	Constant	Kernel bandwidth	Simple, low variance
Local Linear	Linear	Kernel bandwidth	Boundary correction, low bias
LOESS	Linear/Quadratic	Span-based	Robust, well-studied
Smoothing Splines	Cubic pieces	Global with local influence	Smooth, automatic
Gaussian Process	Infinite-dimensional	Covariance-based	Uncertainty quantification
Random Forest	Constant in leaves	Tree partitions	Feature selection, scalable

When to Use Local Polynomial

Local Models for Classification

Local polynomial ideas extend to classification through local logistic regression or locally weighted learning.

Local Logistic Regression:

For binary classification, fit a logistic model locally:

$$P(Y = 1 | \mathbf{x}) = \sigma(\beta_0 + \boldsymbol{\beta}_1^T (\mathbf{x}_i - \mathbf{x}))$$

where $\sigma$ is the sigmoid function.

Minimize weighted negative log-likelihood: $$\sum_i w_i(\mathbf{x}) \cdot \left[-y_i \log \hat{p}_i - (1-y_i) \log(1-\hat{p}_i)\right]$$

Predict: $\hat{P}(Y=1 | \mathbf{x}) = \sigma(\hat{\beta}_0)$

Benefits of Local Logistic:

Probability estimates respect [0, 1] bounds (unlike local linear on 0/1 labels)
Captures local class structure (slopes in log-odds)
Reduces boundary bias like local linear regression

Implementation Complexity:

local_logistic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy.optimize import minimize
 
class LocalLogisticClassifier:
    """
    Local logistic regression classifier.
    
    Fits a weighted logistic regression at each query point.
    """
    
    def __init__(self, k: int = 30):
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y).astype(float)
        return self
    
    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def _predict_one(self, x_query: np.ndarray) -> float:
        """Predict probability for one query point."""
        n, d = self.X_train.shape
        
        # Find k neighbors
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        k_idx = np.argsort(distances)[:self.k]
        h = distances[k_idx[-1]] + 1e-10
        
        # Tricube weights
        u = distances[k_idx] / h
        weights = np.where(u < 1, (1 - u**3)**3, 0.0)
        
        X_local = self.X_train[k_idx]
        y_local = self.y_train[k_idx]
        
        # Centered design matrix
        X_centered = X_local - x_query
        
        # Weighted negative log-likelihood
        def neg_log_likelihood(beta):
            linear = beta[0] + X_centered @ beta[1:]
            p = self._sigmoid(linear)
            p = np.clip(p, 1e-10, 1-1e-10)
            ll = weights @ (y_local * np.log(p) + (1-y_local) * np.log(1-p))
            # Add small L2 regularization for stability
            reg = 0.01 * np.sum(beta[1:]**2)
            return -ll + reg
        
        # Optimize
        beta_init = np.zeros(d + 1)
        result = minimize(neg_log_likelihood, beta_init, method='BFGS')
        
        # Prediction at query point (where x - x_query = 0)
        prob = self._sigmoid(result.x[0])
        return prob
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict probabilities."""
        X = np.atleast_2d(X)
        probs = np.array([self._predict_one(x) for x in X])
        return np.column_stack([1 - probs, probs])
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return (proba[:, 1] > 0.5).astype(int)
 
 
# Demonstration: XOR-like pattern
np.random.seed(42)
 
# Generate XOR pattern
n = 200
X_train = np.random.uniform(-2, 2, (n, 2))
y_train = ((X_train[:, 0] * X_train[:, 1]) > 0).astype(int)
 
# Add noise
flip_idx = np.random.choice(n, size=20, replace=False)
y_train[flip_idx] = 1 - y_train[flip_idx]
 
# Test points
X_test = np.array([
    [0.5, 0.5],   # Class 1 region
    [-0.5, -0.5], # Class 1 region
    [0.5, -0.5],  # Class 0 region
    [0, 0],       # Boundary
])
 
model = LocalLogisticClassifier(k=30)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)
 
print("Local Logistic Classification (XOR pattern):")
print("=" * 50)
for x, p in zip(X_test, probas):
    print(f"x = {x}: P(class=0) = {p[0]:.3f}, P(class=1) = {p[1]:.3f}")

Summary: Local Models

Key Takeaways

•Local constant = Nadaraya-Watson — Standard weighted KNN regression is equivalent to fitting a local constant model.
•Local linear fixes boundary bias — By estimating local slopes, local linear regression eliminates first-order bias at boundaries.
•LOESS is the practical gold standard — Tricube weights, span-based neighborhoods, and optional robustness make it reliable.
•Higher degree trades variance for bias — Degree 2 removes more bias but increases variance; beyond degree 2 is rarely useful.
•Equivalent kernels reveal structure — Every local polynomial estimator is a linear smoother with an implicit equivalent kernel.
•Computation scales with polynomial degree — Local linear is cheap; local quadratic becomes expensive in high dimensions.
•Connections to other methods abound — Links to splines, GPs, and random forests illuminate the broader landscape.
•Local logistic extends to classification — Fits weighted logistic regression locally for calibrated probabilities.

Page Complete

4 / 5