Loading learning content...
Throughout our journey in regression, we've fit global models—functions defined by a single set of parameters that apply uniformly across the entire input domain. Linear regression uses one slope and intercept everywhere. Even polynomial regression, despite its flexibility, commits to a single polynomial that governs all predictions.
But what if the relationship between variables changes fundamentally across different regions of the input space? What if the slope is steep for small values of $x$ but nearly flat for large values? What if there are local patterns that no single polynomial can capture?
Local regression addresses these challenges by fitting separate models in different neighborhoods of the input space. Rather than asking 'What single function best describes all the data?', it asks 'What function best describes the data near this specific point?'
This deceptively simple shift in perspective unlocks remarkable flexibility—and introduces fascinating new challenges in bias-variance tradeoffs, computational complexity, and the curse of dimensionality.
By the end of this page, you will understand: (1) The fundamental philosophy of local regression; (2) The mathematical formulation of LOESS/LOWESS; (3) Weight functions and their role in locality; (4) Local polynomial regression of various degrees; (5) The complete algorithm with implementation details; (6) Robustness extensions for outlier resistance.
From Global to Local:
Consider regression as answering the question: Given a new point $x_0$, what should we predict for $y$?
Global approach: Fit a model $f(x; \boldsymbol{\beta})$ using all training data, then evaluate $\hat{y} = f(x_0; \hat{\boldsymbol{\beta}})$.
Local approach: Fit a model using only data points near $x_0$, giving more weight to closer points.
The key insight is that locally, even complex global relationships often appear simple. A sinusoidal curve looks linear when you zoom in enough. A complicated economic trend might be well-approximated by a line within any small time window.
The Classical Motivation:
Local regression emerged from exploratory data analysis in the 1970s-1980s. Researchers like Cleveland (1979) and Cleveland & Devlin (1988) developed LOESS (LOcally Estimated Scatterplot Smoothing) as a way to visualize trends in noisy data without committing to a parametric form.
You'll see both terms in the literature. LOWESS (LOcally WEighted Scatterplot Smoothing) was Cleveland's original 1979 method using local linear fits. LOESS (1988) generalized this to local polynomial fits of any degree. Today, the terms are often used interchangeably, with LOESS being more common.
Intuition Through Visualization:
Imagine you want to estimate the trend at point $x_0$. The local regression approach:
This sounds computationally expensive (fit a model for every prediction point!)—and it is! But the resulting flexibility is remarkable, and modern computing makes it practical for many applications.
The Weighted Least Squares Framework:
Let $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ be our training data. To predict at a target point $x_0$, we solve a weighted least squares problem:
$$\min_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^{n} w_i(x_0) \left[ y_i - \beta_0 - \beta_1 (x_i - x_0) - \ldots - \beta_p (x_i - x_0)^p \right]^2$$
where $w_i(x_0)$ is the weight assigned to observation $i$ based on its distance from $x_0$.
Key components:
The prediction at $x_0$ is simply $\hat{y}(x_0) = \hat{\beta}_0$.
Matrix Formulation:
For local polynomial regression of degree $p$, define:
$$\mathbf{X}_0 = \begin{bmatrix} 1 & (x_1 - x_0) & (x_1 - x_0)^2 & \cdots & (x_1 - x_0)^p \\ 1 & (x_2 - x_0) & (x_2 - x_0)^2 & \cdots & (x_2 - x_0)^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & (x_n - x_0) & (x_n - x_0)^2 & \cdots & (x_n - x_0)^p \end{bmatrix}$$
$$\mathbf{W}_0 = \text{diag}(w_1(x_0), w_2(x_0), \ldots, w_n(x_0))$$
The weighted least squares solution is:
$$\hat{\boldsymbol{\beta}}_0 = (\mathbf{X}_0^T \mathbf{W}_0 \mathbf{X}_0)^{-1} \mathbf{X}_0^T \mathbf{W}_0 \mathbf{y}$$
And the prediction at $x_0$ is:
$$\hat{y}(x_0) = \mathbf{e}_1^T \hat{\boldsymbol{\beta}}_0 = \mathbf{e}_1^T (\mathbf{X}_0^T \mathbf{W}_0 \mathbf{X}_0)^{-1} \mathbf{X}_0^T \mathbf{W}_0 \mathbf{y}$$
where $\mathbf{e}_1 = [1, 0, 0, \ldots, 0]^T$ extracts the intercept.
Notice that $\hat{y}(x_0)$ is a linear combination of the $y_i$ values: $$\hat{y}(x_0) = \sum_{i=1}^{n} s_i(x_0) y_i$$ where $s_i(x_0) = [\mathbf{e}_1^T (\mathbf{X}_0^T \mathbf{W}_0 \mathbf{X}_0)^{-1} \mathbf{X}_0^T \mathbf{W}_0]_i$. This makes LOESS a linear smoother, sharing properties with kernel smoothers and splines. The $s_i(x_0)$ are called the equivalent kernel weights.
The Weight Function:
The weights $w_i(x_0)$ determine how 'local' the fit is. They should satisfy:
The Tri-cube Weight Function (Cleveland's standard choice):
$$W(u) = \begin{cases} (1 - |u|^3)^3 & \text{if } |u| < 1 \\ 0 & \text{otherwise} \end{cases}$$
This function has appealing properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as npimport matplotlib.pyplot as plt def tricube_weight(u: np.ndarray) -> np.ndarray: """ Tri-cube weight function: W(u) = (1 - |u|^3)^3 for |u| < 1, else 0. Cleveland's standard choice for LOESS. """ u = np.asarray(u) w = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 w[mask] = (1 - np.abs(u[mask])**3)**3 return w def epanechnikov_weight(u: np.ndarray) -> np.ndarray: """ Epanechnikov (parabolic) weight: W(u) = 3/4 * (1 - u^2) for |u| < 1. Optimal in certain asymptotic senses. """ u = np.asarray(u) w = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 w[mask] = 0.75 * (1 - u[mask]**2) return w def gaussian_weight(u: np.ndarray, sigma: float = 1/3) -> np.ndarray: """ Gaussian weight: W(u) = exp(-u^2 / (2*sigma^2)). Infinite support but decays rapidly. """ return np.exp(-u**2 / (2 * sigma**2)) def uniform_weight(u: np.ndarray) -> np.ndarray: """ Uniform (box) weight: W(u) = 1 for |u| < 1, else 0. Simplest but causes discontinuous predictions. """ return (np.abs(u) < 1).astype(float) # Visualize the weight functionsu = np.linspace(-1.5, 1.5, 500) fig, ax = plt.subplots(figsize=(10, 6))ax.plot(u, tricube_weight(u), 'b-', lw=2.5, label='Tri-cube (LOESS standard)')ax.plot(u, epanechnikov_weight(u), 'g--', lw=2, label='Epanechnikov')ax.plot(u, gaussian_weight(u), 'r-.', lw=2, label='Gaussian (σ=1/3)')ax.plot(u, uniform_weight(u), 'm:', lw=2, label='Uniform (box)') ax.axhline(0, color='gray', lw=0.5)ax.axvline(0, color='gray', lw=0.5)ax.axvline(-1, color='gray', lw=0.5, ls='--', alpha=0.5)ax.axvline(1, color='gray', lw=0.5, ls='--', alpha=0.5) ax.set_xlabel('Normalized distance u = (x - x₀) / h', fontsize=12)ax.set_ylabel('Weight W(u)', fontsize=12)ax.set_title('Weight Functions for Local Regression', fontsize=14)ax.legend(loc='upper right', fontsize=10)ax.grid(True, alpha=0.3)ax.set_xlim(-1.5, 1.5)ax.set_ylim(-0.1, 1.1) plt.tight_layout()plt.show()Bandwidth and the Span Parameter:
The key hyperparameter in LOESS is the bandwidth $h$ (also called the span or $\alpha$), which controls the size of the local neighborhood.
Two ways to specify bandwidth:
Fixed bandwidth $h$: The neighborhood has a fixed width. Points within distance $h$ of $x_0$ receive non-zero weight: $$w_i(x_0) = W\left( \frac{x_i - x_0}{h} \right)$$
Nearest-neighbor fraction $\alpha$ (LOESS standard): The neighborhood includes a fraction $\alpha$ of the data. The bandwidth $h(x_0)$ is the distance to the $k$-th nearest neighbor, where $k = \lfloor \alpha n \rfloor$: $$h(x_0) = |x_{(k)} - x_0|$$ where $x_{(k)}$ is the $k$-th nearest neighbor of $x_0$.
The nearest-neighbor approach is adaptive: neighborhoods are smaller in dense regions and larger in sparse regions, promoting consistent local sample sizes.
| Span (α) | Neighborhood | Bias | Variance | Use Case |
|---|---|---|---|---|
| 0.1 - 0.2 | Very local | Low | High | Complex, wiggly patterns; large datasets |
| 0.3 - 0.5 | Moderate | Moderate | Moderate | General purpose; balanced tradeoff |
| 0.5 - 0.7 | Broad | Higher | Lower | Smoother trends; noisy data |
| 0.8 - 1.0 | Very broad | High | Very low | Captures only major trends; small datasets |
Choosing the Local Polynomial Degree:
The degree $p$ of the local polynomial affects both fitting and boundary behavior.
Local constant ($p = 0$): Nadaraya-Watson Estimator
The simplest case fits a horizontal line (constant) at each point: $$\hat{y}(x_0) = \frac{\sum_{i=1}^{n} w_i(x_0) y_i}{\sum_{i=1}^{n} w_i(x_0)}$$
This is simply a weighted average of nearby $y$ values. Fast to compute but suffers from boundary bias—the estimate is pulled toward the center of the local neighborhood.
Local linear ($p = 1$): The Standard Choice
Fits a line at each point. The key advantage: automatic boundary correction. At boundaries, the local line automatically adjusts its intercept to minimize bias.
Local quadratic ($p = 2$):
Fits a parabola at each point. Better captures curvature in the true function, but requires more data and is more variable.
Theory suggests that odd polynomial degrees ($p = 1, 3, 5, \ldots$) provide better bias-variance tradeoffs than even degrees at the same bandwidth. This is because odd-degree polynomials can correct for asymmetric data distributions more effectively. For this reason, $p = 1$ is almost universally preferred over $p = 0$ or $p = 2$.
Boundary Bias Illustrated:
Consider estimating $f(x)$ at $x_0$ near the left boundary. With local constant fitting:
With local linear fitting:
This boundary correction is one of the most important reasons to use local linear regression over kernel smoothing with local constant fits.
Algorithm: LOESS with Local Linear Regression
Given training data $(x_1, y_1), \ldots, (x_n, y_n)$, span parameter $\alpha$, and evaluation points $x_0^{(1)}, \ldots, x_0^{(m)}$:
For each evaluation point $x_0$:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom typing import Tuple, Optional def tricube(u: np.ndarray) -> np.ndarray: """Tri-cube weight function.""" u = np.asarray(u) w = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 w[mask] = (1 - np.abs(u[mask])**3)**3 return w def loess_fit(x: np.ndarray, y: np.ndarray, x_eval: Optional[np.ndarray] = None, span: float = 0.75, degree: int = 1) -> Tuple[np.ndarray, np.ndarray]: """ LOESS (Locally Estimated Scatterplot Smoothing). Parameters: x: Input values (n,) y: Target values (n,) x_eval: Points at which to evaluate (m,). If None, use x. span: Fraction of data to use in each local fit (0 < span <= 1) degree: Local polynomial degree (0, 1, or 2) Returns: x_eval: Evaluation points y_hat: Fitted values at evaluation points """ x = np.asarray(x).flatten() y = np.asarray(y).flatten() n = len(x) if x_eval is None: x_eval = x.copy() else: x_eval = np.asarray(x_eval).flatten() m = len(x_eval) y_hat = np.zeros(m) # Number of nearest neighbors k = int(np.ceil(span * n)) k = max(k, degree + 1) # Need at least degree+1 points for a fit k = min(k, n) # Can't use more points than we have for j, x0 in enumerate(x_eval): # Step 1: Find distances to all points distances = np.abs(x - x0) # Step 2: Find the k-th smallest distance (bandwidth) sorted_distances = np.sort(distances) h = sorted_distances[k - 1] # Prevent zero bandwidth if h < 1e-10: h = 1e-10 # Step 3: Compute weights u = distances / h w = tricube(u) # Step 4: Local polynomial fit # Build design matrix centered at x0 X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)]) # Weighted least squares: (X'WX)^{-1} X'Wy W = np.diag(w) XtW = X_local.T @ W XtWX = XtW @ X_local XtWy = XtW @ y # Solve the normal equations try: beta = np.linalg.solve(XtWX, XtWy) except np.linalg.LinAlgError: # Singular matrix - use pseudoinverse beta = np.linalg.lstsq(X_local * w[:, np.newaxis], y * w, rcond=None)[0] # Step 5: Prediction is the intercept (since we centered at x0) y_hat[j] = beta[0] return x_eval, y_hat # =============================================================================# DEMONSTRATION# =============================================================================np.random.seed(42) # Generate data with a complex patternn = 100x = np.sort(np.random.uniform(0, 2*np.pi, n))y_true = np.sin(x) + 0.5 * np.sin(3*x) # True functiony = y_true + np.random.normal(0, 0.3, n) # Add noise # Fit LOESS with different spansspans = [0.2, 0.4, 0.7]x_fine = np.linspace(0, 2*np.pi, 200) print("LOESS Demonstration")print("=" * 50)print(f"{'Span':>8} | {'Mean Squared Error':>20}")print("-" * 50) for span in spans: _, y_hat = loess_fit(x, y, x_eval=x_fine, span=span, degree=1) _, y_hat_train = loess_fit(x, y, span=span, degree=1) mse = np.mean((y - y_hat_train)**2) print(f"{span:>8.2f} | {mse:>20.6f}")LOESS has O(n²) complexity for fitting all training points, since each of the $n$ predictions requires examining all $n$ data points. For large datasets, consider: (1) evaluating at fewer points than training points; (2) using approximate nearest neighbors; (3) switching to kernel smoothing with fixed bandwidth; or (4) using binning/interpolation strategies.
The Outlier Problem:
Standard LOESS uses least squares, which is notoriously sensitive to outliers. A single aberrant point can substantially distort the local fit, affecting predictions in its neighborhood.
Cleveland's Robustifying Procedure:
Cleveland (1979) proposed an iterative reweighting scheme that down-weights observations with large residuals:
The Bisquare Weight Function:
$$B(u) = \begin{cases} (1 - u^2)^2 & \text{if } |u| < 1 \\ 0 & \text{otherwise} \end{cases}$$
This function:
The factor of 6 is chosen so that under Gaussian errors, roughly 99.7% of observations get non-zero robustness weights in the initial iteration.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom typing import Tuple def bisquare(u: np.ndarray) -> np.ndarray: """Bisquare weight function for robustifying.""" u = np.asarray(u) w = np.zeros_like(u, dtype=float) mask = np.abs(u) < 1 w[mask] = (1 - u[mask]**2)**2 return w def robust_loess(x: np.ndarray, y: np.ndarray, span: float = 0.75, degree: int = 1, robustifying_iterations: int = 3) -> Tuple[np.ndarray, np.ndarray]: """ Robust LOESS with iterative reweighting for outlier resistance. Parameters: x: Input values y: Target values span: Fraction of data used in local fits degree: Local polynomial degree robustifying_iterations: Number of robustifying iterations (0 = none) Returns: x: Input values (sorted) y_hat: Fitted values """ x = np.asarray(x).flatten() y = np.asarray(y).flatten() n = len(x) # Sort by x for consistency order = np.argsort(x) x = x[order] y = y[order] k = int(np.ceil(span * n)) k = max(k, degree + 1) k = min(k, n) # Initialize robustness weights to 1 delta = np.ones(n) for iteration in range(robustifying_iterations + 1): y_hat = np.zeros(n) for j in range(n): x0 = x[j] # Distances and bandwidth distances = np.abs(x - x0) sorted_distances = np.sort(distances) h = max(sorted_distances[k - 1], 1e-10) # Combined weights: locality * robustness w = (1 - np.abs((x - x0) / h)**3)**3 * (np.abs((x - x0) / h) < 1) w = w * delta # Apply robustness weights # Local polynomial fit X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)]) try: W = np.diag(w) XtWX = X_local.T @ W @ X_local XtWy = X_local.T @ W @ y beta = np.linalg.solve(XtWX, XtWy) except: # Fallback for singular cases beta = np.zeros(degree + 1) beta[0] = np.average(y, weights=w + 1e-10) y_hat[j] = beta[0] # Compute robustness weights for next iteration if iteration < robustifying_iterations: residuals = y - y_hat s = np.median(np.abs(residuals)) # Median absolute deviation if s > 1e-10: u = residuals / (6.0 * s) delta = bisquare(u) else: break # Perfect fit, no need to continue return x, y_hat # Demonstration with outliersnp.random.seed(42)n = 80x = np.sort(np.random.uniform(0, 4, n))y_true = 2 * np.sin(x)y = y_true + np.random.normal(0, 0.3, n) # Add some outliersoutlier_indices = [10, 25, 50, 65]y[outlier_indices] += np.array([3, -4, 5, -3.5]) # Compare standard and robust LOESS_, y_standard = robust_loess(x, y, span=0.3, robustifying_iterations=0)_, y_robust = robust_loess(x, y, span=0.3, robustifying_iterations=3) mse_standard = np.mean((y_standard - y_true)**2)mse_robust = np.mean((y_robust - y_true)**2) print("Robust LOESS vs Standard LOESS (with 4 outliers)")print("=" * 50)print(f"Standard LOESS MSE: {mse_standard:.4f}")print(f"Robust LOESS MSE: {mse_robust:.4f}")print(f"Improvement: {100*(mse_standard - mse_robust)/mse_standard:.1f}%")Use robustifying iterations when: (1) Data may contain outliers or recording errors; (2) The noise distribution is heavy-tailed; (3) You want a 'safe' default that works well across scenarios. The computational overhead is typically 2-3x, but the protection against outliers is often worth it.
Bias-Variance Tradeoff in Local Regression:
For local linear regression with bandwidth $h$, the asymptotic bias and variance at a point $x_0$ are:
Bias: $$\text{Bias}[\hat{f}(x_0)] \approx \frac{h^2}{2} f''(x_0) \mu_2$$
where $\mu_2 = \int u^2 K(u) du$ for kernel $K$.
Variance: $$\text{Var}[\hat{f}(x_0)] \approx \frac{\sigma^2}{n h p(x_0)} u_0$$
where $p(x_0)$ is the density of $x$ at $x_0$, and $ u_0 = \int K(u)^2 du$.
Key insights:
Optimal Bandwidth:
Minimizing the asymptotic mean squared error (MSE) yields the optimal bandwidth:
$$h_{\text{opt}}(x_0) \propto \left( \frac{\sigma^2}{n p(x_0) [f''(x_0)]^2} \right)^{1/5}$$
Convergence rate: $$\text{MSE}[\hat{f}(x_0)] = O(n^{-4/5})$$
This is slower than the parametric rate of $O(n^{-1})$ but faster than the local constant rate of $O(n^{-2/3})$ (for local linear vs. local constant).
Practical implication: The $n^{-4/5}$ rate means you need more data to achieve the same accuracy as a correctly specified parametric model—the price of flexibility.
| Method | Convergence Rate | Required n for ε=0.01 error |
|---|---|---|
| Parametric (correctly specified) | O(n⁻¹) | ~100 |
| Local linear regression | O(n⁻⁴ᐟ⁵) | ~3,200 |
| Local constant (Nadaraya-Watson) | O(n⁻²ᐟ³) | ~10,000 |
| Nonparametric in d dimensions | O(n⁻⁴ᐟ⁽⁴⁺ᵈ⁾) | Explodes with d |
LOESS has effective degrees of freedom defined as $\text{tr}(\mathbf{S})$ where $\mathbf{S}$ is the smoothing matrix ($\hat{\mathbf{y}} = \mathbf{S} \mathbf{y}$). This quantifies the 'complexity' of the fit and is useful for model comparison. Smaller bandwidth → more degrees of freedom → more flexible fit.
We've established the complete framework for local regression, from intuition to rigorous implementation. Let's consolidate the key concepts:
What's Next:
LOESS is one approach to local fitting. The next page explores kernel smoothing—a closely related technique that uses fixed bandwidth and has a cleaner theoretical foundation. We'll see how kernel smoothers relate to LOESS and when each is preferred.
You now understand local regression from theory to implementation. LOESS provides a powerful, flexible tool for exploring nonlinear relationships without parametric assumptions. Next, we'll explore kernel smoothing and see how it complements and extends these ideas.