Machine LearningAdvanced Regression

Nonparametric Regression

LevelAdvanced

Duration120 mins

TopicAdvanced Regression

1 / 5

Local Regression (LOESS/LOWESS)

Beyond Global Models: The Local Perspective

Throughout our journey in regression, we've fit global models—functions defined by a single set of parameters that apply uniformly across the entire input domain. Linear regression uses one slope and intercept everywhere. Even polynomial regression, despite its flexibility, commits to a single polynomial that governs all predictions.\n\nBut what if the relationship between variables changes fundamentally across different regions of the input space? What if the slope is steep for small values of $x$ but nearly flat for large values? What if there are local patterns that no single polynomial can capture?\n\nLocal regression addresses these challenges by fitting separate models in different neighborhoods of the input space. Rather than asking 'What single function best describes all the data?', it asks 'What function best describes the data near this specific point?'\n\nThis deceptively simple shift in perspective unlocks remarkable flexibility—and introduces fascinating new challenges in bias-variance tradeoffs, computational complexity, and the curse of dimensionality.

What You Will Learn

By the end of this page, you will understand: (1) The fundamental philosophy of local regression; (2) The mathematical formulation of LOESS/LOWESS; (3) Weight functions and their role in locality; (4) Local polynomial regression of various degrees; (5) The complete algorithm with implementation details; (6) Robustness extensions for outlier resistance.

The Fundamental Idea: Local Fitting

From Global to Local:\n\nConsider regression as answering the question: Given a new point $x_0$, what should we predict for $y$?\n\nGlobal approach: Fit a model $f(x; \boldsymbol{\beta})$ using all training data, then evaluate $\hat{y} = f(x_0; \hat{\boldsymbol{\beta}})$.\n\nLocal approach: Fit a model using only data points near $x_0$, giving more weight to closer points.\n\nThe key insight is that locally, even complex global relationships often appear simple. A sinusoidal curve looks linear when you zoom in enough. A complicated economic trend might be well-approximated by a line within any small time window.\n\nThe Classical Motivation:\n\nLocal regression emerged from exploratory data analysis in the 1970s-1980s. Researchers like Cleveland (1979) and Cleveland & Devlin (1988) developed LOESS (LOcally Estimated Scatterplot Smoothing) as a way to visualize trends in noisy data without committing to a parametric form.

LOESS vs LOWESS

You'll see both terms in the literature. LOWESS (LOcally WEighted Scatterplot Smoothing) was Cleveland's original 1979 method using local linear fits. LOESS (1988) generalized this to local polynomial fits of any degree. Today, the terms are often used interchangeably, with LOESS being more common.

Intuition Through Visualization:\n\nImagine you want to estimate the trend at point $x_0$. The local regression approach:\n\n1. Define a neighborhood around $x_0$ containing some fraction of the data\n2. Weight the data so nearby points count more than distant ones\n3. Fit a simple model (line or low-degree polynomial) using weighted least squares\n4. Extract the prediction at $x_0$ from this local model\n5. Repeat for every point where you want a prediction\n\nThis sounds computationally expensive (fit a model for every prediction point!)—and it is! But the resulting flexibility is remarkable, and modern computing makes it practical for many applications.

Mathematical Formulation

The Weighted Least Squares Framework:\n\nLet $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$ be our training data. To predict at a target point $x_0$, we solve a weighted least squares problem:\n\n$$\min_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^{n} w_i(x_0) \left[ y_i - \beta_0 - \beta_1 (x_i - x_0) - \ldots - \beta_p (x_i - x_0)^p \right]^2$$\n\nwhere $w_i(x_0)$ is the weight assigned to observation $i$ based on its distance from $x_0$.\n\nKey components:\n- Centering at $x_0$: We use $(x_i - x_0)$ rather than $x_i$ so that $\beta_0$ directly gives us the prediction at $x_0$\n- Polynomial degree $p$: Controls local model complexity (typically $p = 0, 1, \text{ or } 2$)\n- Weight function $w_i(x_0)$: Determines how locality is defined\n\nThe prediction at $x_0$ is simply $\hat{y}(x_0) = \hat{\beta}_0$.

Matrix Formulation:\n\nFor local polynomial regression of degree $p$, define:\n\n$$\mathbf{X}_0 = \begin{bmatrix} 1 & (x_1 - x_0) & (x_1 - x_0)^2 & \cdots & (x_1 - x_0)^p \\ 1 & (x_2 - x_0) & (x_2 - x_0)^2 & \cdots & (x_2 - x_0)^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & (x_n - x_0) & (x_n - x_0)^2 & \cdots & (x_n - x_0)^p \end{bmatrix}$$\n\n$$\mathbf{W}_0 = \text{diag}(w_1(x_0), w_2(x_0), \ldots, w_n(x_0))$$\n\nThe weighted least squares solution is:\n\n$$\hat{\boldsymbol{\beta}}_0 = (\mathbf{X}_0^T \mathbf{W}_0 \mathbf{X}_0)^{-1} \mathbf{X}_0^T \mathbf{W}_0 \mathbf{y}$$\n\nAnd the prediction at $x_0$ is:\n\n$$\hat{y}(x_0) = \mathbf{e}_1^T \hat{\boldsymbol{\beta}}_0 = \mathbf{e}_1^T (\mathbf{X}_0^T \mathbf{W}_0 \mathbf{X}_0)^{-1} \mathbf{X}_0^T \mathbf{W}_0 \mathbf{y}$$\n\nwhere $\mathbf{e}_1 = [1, 0, 0, \ldots, 0]^T$ extracts the intercept.

Linear Smoother Form

Notice that $\hat{y}(x_0)$ is a linear combination of the $y_i$ values:\n$$\hat{y}(x_0) = \sum_{i=1}^{n} s_i(x_0) y_i$$\nwhere $s_i(x_0) = [\mathbf{e}_1^T (\mathbf{X}_0^T \mathbf{W}_0 \mathbf{X}_0)^{-1} \mathbf{X}_0^T \mathbf{W}_0]_i$. This makes LOESS a linear smoother, sharing properties with kernel smoothers and splines. The $s_i(x_0)$ are called the equivalent kernel weights.

Weight Functions and Bandwidth

The Weight Function:\n\nThe weights $w_i(x_0)$ determine how 'local' the fit is. They should satisfy:\n1. Locality: Large for points near $x_0$, small or zero for distant points\n2. Smoothness: Continuous decay prevents abrupt changes in predictions\n3. Symmetry: $w(u) = w(-u)$ treats left and right neighbors equally\n\nThe Tri-cube Weight Function (Cleveland's standard choice):\n\n$$W(u) = \begin{cases} (1 - |u|^3)^3 & \text{if } |u| < 1 \\ 0 & \text{otherwise} \end{cases}$$\n\nThis function has appealing properties:\n- Compact support: Zero outside $[-1, 1]$, so distant points are completely ignored\n- Smooth at boundaries: Both the function and its first derivative are continuous at $|u| = 1$\n- Peaked at center: Maximum weight at $u = 0$, smooth decay outward

weight_functions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def tricube_weight(u: np.ndarray) -> np.ndarray:
    """
    Tri-cube weight function: W(u) = (1 - |u|^3)^3 for |u| < 1, else 0.
    Cleveland's standard choice for LOESS.
    """
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = (1 - np.abs(u[mask])**3)**3
    return w
 
def epanechnikov_weight(u: np.ndarray) -> np.ndarray:
    """
    Epanechnikov (parabolic) weight: W(u) = 3/4 * (1 - u^2) for |u| < 1.
    Optimal in certain asymptotic senses.
    """
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = 0.75 * (1 - u[mask]**2)
    return w
 
def gaussian_weight(u: np.ndarray, sigma: float = 1/3) -> np.ndarray:
    """
    Gaussian weight: W(u) = exp(-u^2 / (2*sigma^2)).
    Infinite support but decays rapidly.
    """
    return np.exp(-u**2 / (2 * sigma**2))
 
def uniform_weight(u: np.ndarray) -> np.ndarray:
    """
    Uniform (box) weight: W(u) = 1 for |u| < 1, else 0.
    Simplest but causes discontinuous predictions.
    """
    return (np.abs(u) < 1).astype(float)
 
# Visualize the weight functions
u = np.linspace(-1.5, 1.5, 500)
 
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(u, tricube_weight(u), 'b-', lw=2.5, label='Tri-cube (LOESS standard)')
ax.plot(u, epanechnikov_weight(u), 'g--', lw=2, label='Epanechnikov')
ax.plot(u, gaussian_weight(u), 'r-.', lw=2, label='Gaussian (σ=1/3)')
ax.plot(u, uniform_weight(u), 'm:', lw=2, label='Uniform (box)')
 
ax.axhline(0, color='gray', lw=0.5)
ax.axvline(0, color='gray', lw=0.5)
ax.axvline(-1, color='gray', lw=0.5, ls='--', alpha=0.5)
ax.axvline(1, color='gray', lw=0.5, ls='--', alpha=0.5)
 
ax.set_xlabel('Normalized distance u = (x - x₀) / h', fontsize=12)
ax.set_ylabel('Weight W(u)', fontsize=12)
ax.set_title('Weight Functions for Local Regression', fontsize=14)
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-0.1, 1.1)
 
plt.tight_layout()
plt.show()

Bandwidth and the Span Parameter:\n\nThe key hyperparameter in LOESS is the bandwidth $h$ (also called the span or $\alpha$), which controls the size of the local neighborhood.\n\nTwo ways to specify bandwidth:\n\n1. Fixed bandwidth $h$: The neighborhood has a fixed width. Points within distance $h$ of $x_0$ receive non-zero weight:\n$$w_i(x_0) = W\left( \frac{x_i - x_0}{h} \right)$$\n\n2. Nearest-neighbor fraction $\alpha$ (LOESS standard): The neighborhood includes a fraction $\alpha$ of the data. The bandwidth $h(x_0)$ is the distance to the $k$-th nearest neighbor, where $k = \lfloor \alpha n \rfloor$:\n$$h(x_0) = |x_{(k)} - x_0|$$\nwhere $x_{(k)}$ is the $k$-th nearest neighbor of $x_0$.\n\nThe nearest-neighbor approach is adaptive: neighborhoods are smaller in dense regions and larger in sparse regions, promoting consistent local sample sizes.

Bandwidth/Span Selection Guidelines
Span (α)	Neighborhood	Bias	Variance	Use Case
0.1 - 0.2	Very local	Low	High	Complex, wiggly patterns; large datasets
0.3 - 0.5	Moderate	Moderate	Moderate	General purpose; balanced tradeoff
0.5 - 0.7	Broad	Higher	Lower	Smoother trends; noisy data
0.8 - 1.0	Very broad	High	Very low	Captures only major trends; small datasets

Local Polynomial Degrees

Choosing the Local Polynomial Degree:\n\nThe degree $p$ of the local polynomial affects both fitting and boundary behavior.\n\nLocal constant ($p = 0$): Nadaraya-Watson Estimator\n\nThe simplest case fits a horizontal line (constant) at each point:\n$$\hat{y}(x_0) = \frac{\sum_{i=1}^{n} w_i(x_0) y_i}{\sum_{i=1}^{n} w_i(x_0)}$$\n\nThis is simply a weighted average of nearby $y$ values. Fast to compute but suffers from boundary bias—the estimate is pulled toward the center of the local neighborhood.\n\nLocal linear ($p = 1$): The Standard Choice\n\nFits a line at each point. The key advantage: automatic boundary correction. At boundaries, the local line automatically adjusts its intercept to minimize bias.\n\nLocal quadratic ($p = 2$):\n\nFits a parabola at each point. Better captures curvature in the true function, but requires more data and is more variable.

Local Linear (p=1) Advantages

•Boundary correction: No systematic bias at edges
•Efficiency: Good bias-variance tradeoff
•Interpretability: Local slope has meaning
•Stability: Moderate data requirements
•Standard choice: Default in most implementations

Local Constant (p=0) Issues

•Boundary bias: Estimates pulled toward interior
•Design bias: Systematic error in non-uniform designs
•Slope blindness: Cannot capture local trends
•Historical artifact: Mainly of theoretical interest
•Rarely optimal: Use local linear instead

Odd vs. Even Degrees

Theory suggests that odd polynomial degrees ($p = 1, 3, 5, \ldots$) provide better bias-variance tradeoffs than even degrees at the same bandwidth. This is because odd-degree polynomials can correct for asymmetric data distributions more effectively. For this reason, $p = 1$ is almost universally preferred over $p = 0$ or $p = 2$.

Boundary Bias Illustrated:\n\nConsider estimating $f(x)$ at $x_0$ near the left boundary. With local constant fitting:\n- Most data points are to the right of $x_0$\n- The weighted average is pulled toward the right\n- If the function is increasing, we underestimate $f(x_0)$\n\nWith local linear fitting:\n- We fit a line $\hat{y} = \beta_0 + \beta_1 (x - x_0)$\n- The line extrapolates the local trend\n- The intercept $\beta_0$ is a bias-corrected estimate at $x_0$\n\nThis boundary correction is one of the most important reasons to use local linear regression over kernel smoothing with local constant fits.

The Complete LOESS Algorithm

Algorithm: LOESS with Local Linear Regression\n\nGiven training data $(x_1, y_1), \ldots, (x_n, y_n)$, span parameter $\alpha$, and evaluation points $x_0^{(1)}, \ldots, x_0^{(m)}$:\n\nFor each evaluation point $x_0$:\n\n1. Find $k$-nearest neighbors: $k = \lfloor \alpha n \rfloor$\n2. Compute bandwidth: $h(x_0) = $ distance to $k$-th nearest neighbor\n3. Compute weights: $w_i = W\left( \frac{x_i - x_0}{h(x_0)} \right)$ using tri-cube kernel\n4. Solve weighted least squares:\n $$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} w_i [y_i - \beta_0 - \beta_1(x_i - x_0)]^2$$\n5. Extract prediction: $\hat{y}(x_0) = \hat{\beta}_0$

loess_implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Tuple, Optional
 
def tricube(u: np.ndarray) -> np.ndarray:
    """Tri-cube weight function."""
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = (1 - np.abs(u[mask])**3)**3
    return w
 
def loess_fit(x: np.ndarray, y: np.ndarray, 
              x_eval: Optional[np.ndarray] = None,
              span: float = 0.75,
              degree: int = 1) -> Tuple[np.ndarray, np.ndarray]:
    """
    LOESS (Locally Estimated Scatterplot Smoothing).
    
    Parameters:
        x: Input values (n,)
        y: Target values (n,)
        x_eval: Points at which to evaluate (m,). If None, use x.
        span: Fraction of data to use in each local fit (0 < span <= 1)
        degree: Local polynomial degree (0, 1, or 2)
        
    Returns:
        x_eval: Evaluation points
        y_hat: Fitted values at evaluation points
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    if x_eval is None:
        x_eval = x.copy()
    else:
        x_eval = np.asarray(x_eval).flatten()
    
    m = len(x_eval)
    y_hat = np.zeros(m)
    
    # Number of nearest neighbors
    k = int(np.ceil(span * n))
    k = max(k, degree + 1)  # Need at least degree+1 points for a fit
    k = min(k, n)           # Can't use more points than we have
    
    for j, x0 in enumerate(x_eval):
        # Step 1: Find distances to all points
        distances = np.abs(x - x0)
        
        # Step 2: Find the k-th smallest distance (bandwidth)
        sorted_distances = np.sort(distances)
        h = sorted_distances[k - 1]
        
        # Prevent zero bandwidth
        if h < 1e-10:
            h = 1e-10
        
        # Step 3: Compute weights
        u = distances / h
        w = tricube(u)
        
        # Step 4: Local polynomial fit
        # Build design matrix centered at x0
        X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
        
        # Weighted least squares: (X'WX)^{-1} X'Wy
        W = np.diag(w)
        XtW = X_local.T @ W
        XtWX = XtW @ X_local
        XtWy = XtW @ y
        
        # Solve the normal equations
        try:
            beta = np.linalg.solve(XtWX, XtWy)
        except np.linalg.LinAlgError:
            # Singular matrix - use pseudoinverse
            beta = np.linalg.lstsq(X_local * w[:, np.newaxis], y * w, rcond=None)[0]
        
        # Step 5: Prediction is the intercept (since we centered at x0)
        y_hat[j] = beta[0]
    
    return x_eval, y_hat
 
# =============================================================================
# DEMONSTRATION
# =============================================================================
np.random.seed(42)
 
# Generate data with a complex pattern
n = 100
x = np.sort(np.random.uniform(0, 2*np.pi, n))
y_true = np.sin(x) + 0.5 * np.sin(3*x)  # True function
y = y_true + np.random.normal(0, 0.3, n)  # Add noise
 
# Fit LOESS with different spans
spans = [0.2, 0.4, 0.7]
x_fine = np.linspace(0, 2*np.pi, 200)
 
print("LOESS Demonstration")
print("=" * 50)
print(f"{'Span':>8} | {'Mean Squared Error':>20}")
print("-" * 50)
 
for span in spans:
    _, y_hat = loess_fit(x, y, x_eval=x_fine, span=span, degree=1)
    _, y_hat_train = loess_fit(x, y, span=span, degree=1)
    mse = np.mean((y - y_hat_train)**2)
    print(f"{span:>8.2f} | {mse:>20.6f}")

Computational Complexity

LOESS has O(n²) complexity for fitting all training points, since each of the $n$ predictions requires examining all $n$ data points. For large datasets, consider: (1) evaluating at fewer points than training points; (2) using approximate nearest neighbors; (3) switching to kernel smoothing with fixed bandwidth; or (4) using binning/interpolation strategies.

Robust LOESS: Resistance to Outliers

The Outlier Problem:\n\nStandard LOESS uses least squares, which is notoriously sensitive to outliers. A single aberrant point can substantially distort the local fit, affecting predictions in its neighborhood.\n\nCleveland's Robustifying Procedure:\n\nCleveland (1979) proposed an iterative reweighting scheme that down-weights observations with large residuals:\n\n1. Initial fit: Compute $\hat{y}_i$ using standard LOESS\n2. Compute residuals: $r_i = y_i - \hat{y}_i$\n3. Scale residuals: $u_i = r_i / (6 \cdot \text{median}|r_i|)$\n4. Compute robustness weights: $\delta_i = B(u_i)$ where $B$ is the bisquare function\n5. Refit: Multiply original weights by $\delta_i$ and recompute LOESS\n6. Iterate steps 2-5 (typically 2-3 times)

The Bisquare Weight Function:\n\n$$B(u) = \begin{cases} (1 - u^2)^2 & \text{if } |u| < 1 \\ 0 & \text{otherwise} \end{cases}$$\n\nThis function:\n- Gives weight 1 to residuals much smaller than $6 \cdot \text{median}|r|$\n- Gradually down-weights larger residuals\n- Completely ignores residuals larger than $6 \cdot \text{median}|r|$\n\nThe factor of 6 is chosen so that under Gaussian errors, roughly 99.7% of observations get non-zero robustness weights in the initial iteration.

robust_loess.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from typing import Tuple
 
def bisquare(u: np.ndarray) -> np.ndarray:
    """Bisquare weight function for robustifying."""
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = (1 - u[mask]**2)**2
    return w
 
def robust_loess(x: np.ndarray, y: np.ndarray,
                 span: float = 0.75,
                 degree: int = 1,
                 robustifying_iterations: int = 3) -> Tuple[np.ndarray, np.ndarray]:
    """
    Robust LOESS with iterative reweighting for outlier resistance.
    
    Parameters:
        x: Input values
        y: Target values
        span: Fraction of data used in local fits
        degree: Local polynomial degree
        robustifying_iterations: Number of robustifying iterations (0 = none)
        
    Returns:
        x: Input values (sorted)
        y_hat: Fitted values
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    # Sort by x for consistency
    order = np.argsort(x)
    x = x[order]
    y = y[order]
    
    k = int(np.ceil(span * n))
    k = max(k, degree + 1)
    k = min(k, n)
    
    # Initialize robustness weights to 1
    delta = np.ones(n)
    
    for iteration in range(robustifying_iterations + 1):
        y_hat = np.zeros(n)
        
        for j in range(n):
            x0 = x[j]
            
            # Distances and bandwidth
            distances = np.abs(x - x0)
            sorted_distances = np.sort(distances)
            h = max(sorted_distances[k - 1], 1e-10)
            
            # Combined weights: locality * robustness
            w = (1 - np.abs((x - x0) / h)**3)**3 * (np.abs((x - x0) / h) < 1)
            w = w * delta  # Apply robustness weights
            
            # Local polynomial fit
            X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
            
            try:
                W = np.diag(w)
                XtWX = X_local.T @ W @ X_local
                XtWy = X_local.T @ W @ y
                beta = np.linalg.solve(XtWX, XtWy)
            except:
                # Fallback for singular cases
                beta = np.zeros(degree + 1)
                beta[0] = np.average(y, weights=w + 1e-10)
            
            y_hat[j] = beta[0]
        
        # Compute robustness weights for next iteration
        if iteration < robustifying_iterations:
            residuals = y - y_hat
            s = np.median(np.abs(residuals))  # Median absolute deviation
            if s > 1e-10:
                u = residuals / (6.0 * s)
                delta = bisquare(u)
            else:
                break  # Perfect fit, no need to continue
    
    return x, y_hat
 
# Demonstration with outliers
np.random.seed(42)
n = 80
x = np.sort(np.random.uniform(0, 4, n))
y_true = 2 * np.sin(x)
y = y_true + np.random.normal(0, 0.3, n)
 
# Add some outliers
outlier_indices = [10, 25, 50, 65]
y[outlier_indices] += np.array([3, -4, 5, -3.5])
 
# Compare standard and robust LOESS
_, y_standard = robust_loess(x, y, span=0.3, robustifying_iterations=0)
_, y_robust = robust_loess(x, y, span=0.3, robustifying_iterations=3)
 
mse_standard = np.mean((y_standard - y_true)**2)
mse_robust = np.mean((y_robust - y_true)**2)
 
print("Robust LOESS vs Standard LOESS (with 4 outliers)")
print("=" * 50)
print(f"Standard LOESS MSE: {mse_standard:.4f}")
print(f"Robust LOESS MSE:   {mse_robust:.4f}")
print(f"Improvement:        {100*(mse_standard - mse_robust)/mse_standard:.1f}%")

When to Use Robust LOESS

Use robustifying iterations when: (1) Data may contain outliers or recording errors; (2) The noise distribution is heavy-tailed; (3) You want a 'safe' default that works well across scenarios. The computational overhead is typically 2-3x, but the protection against outliers is often worth it.

Theoretical Properties

Bias-Variance Tradeoff in Local Regression:\n\nFor local linear regression with bandwidth $h$, the asymptotic bias and variance at a point $x_0$ are:\n\nBias:\n$$\text{Bias}[\hat{f}(x_0)] \approx \frac{h^2}{2} f''(x_0) \mu_2$$\n\nwhere $\mu_2 = \int u^2 K(u) du$ for kernel $K$.\n\nVariance:\n$$\text{Var}[\hat{f}(x_0)] \approx \frac{\sigma^2}{n h p(x_0)} \nu_0$$\n\nwhere $p(x_0)$ is the density of $x$ at $x_0$, and $\nu_0 = \int K(u)^2 du$.\n\nKey insights:\n- Bias increases with $h^2$: larger bandwidth → more smoothing → more bias\n- Bias depends on $f''(x_0)$: flatter regions can tolerate larger bandwidth\n- Variance decreases with $nh$: larger bandwidth → more data → less variance\n- Variance depends inversely on local data density $p(x_0)$

Optimal Bandwidth:\n\nMinimizing the asymptotic mean squared error (MSE) yields the optimal bandwidth:\n\n$$h_{\text{opt}}(x_0) \propto \left( \frac{\sigma^2}{n p(x_0) [f''(x_0)]^2} \right)^{1/5}$$\n\nConvergence rate:\n$$\text{MSE}[\hat{f}(x_0)] = O(n^{-4/5})$$\n\nThis is slower than the parametric rate of $O(n^{-1})$ but faster than the local constant rate of $O(n^{-2/3})$ (for local linear vs. local constant).\n\nPractical implication: The $n^{-4/5}$ rate means you need more data to achieve the same accuracy as a correctly specified parametric model—the price of flexibility.

Comparison of Convergence Rates
Method	Convergence Rate	Required n for ε=0.01 error
Parametric (correctly specified)	O(n⁻¹)	~100
Local linear regression	O(n⁻⁴ᐟ⁵)	~3,200
Local constant (Nadaraya-Watson)	O(n⁻²ᐟ³)	~10,000
Nonparametric in d dimensions	O(n⁻⁴ᐟ⁽⁴⁺ᵈ⁾)	Explodes with d

Degrees of Freedom

LOESS has effective degrees of freedom defined as $\text{tr}(\mathbf{S})$ where $\mathbf{S}$ is the smoothing matrix ($\hat{\mathbf{y}} = \mathbf{S} \mathbf{y}$). This quantifies the 'complexity' of the fit and is useful for model comparison. Smaller bandwidth → more degrees of freedom → more flexible fit.

Summary: Local Regression (LOESS)

We've established the complete framework for local regression, from intuition to rigorous implementation. Let's consolidate the key concepts:

Key Takeaways

•Local philosophy: Fit simple models in neighborhoods rather than one complex global model. This captures varying local structure without committing to a global functional form.
•Weighted least squares: Each prediction uses WLS with weights that decay with distance. The tri-cube kernel is standard for its compact support and smoothness.
•Local linear (degree 1): The preferred choice due to automatic boundary correction. Higher degrees rarely improve performance.
•Span parameter: Controls bias-variance tradeoff. Smaller span → more flexible, higher variance; larger span → more smooth, higher bias.
•Adaptive bandwidth: Using nearest-neighbor fraction rather than fixed bandwidth adapts to local data density.
•Robust LOESS: Iterative reweighting using bisquare weights provides resistance to outliers.
•Computational cost: O(n²) for full evaluation; practical strategies exist for large datasets.

What's Next:\n\nLOESS is one approach to local fitting. The next page explores kernel smoothing—a closely related technique that uses fixed bandwidth and has a cleaner theoretical foundation. We'll see how kernel smoothers relate to LOESS and when each is preferred.

Page Complete

You now understand local regression from theory to implementation. LOESS provides a powerful, flexible tool for exploring nonlinear relationships without parametric assumptions. Next, we'll explore kernel smoothing and see how it complements and extends these ideas.

1 / 5

Loading learning content...

Machine LearningAdvanced Regression

Nonparametric Regression

LevelAdvanced

Duration120 mins

TopicAdvanced Regression

1 / 5

Local Regression (LOESS/LOWESS)

Beyond Global Models: The Local Perspective

What You Will Learn

The Fundamental Idea: Local Fitting

LOESS vs LOWESS

Mathematical Formulation

Linear Smoother Form

Weight Functions and Bandwidth

weight_functions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def tricube_weight(u: np.ndarray) -> np.ndarray:
    """
    Tri-cube weight function: W(u) = (1 - |u|^3)^3 for |u| < 1, else 0.
    Cleveland's standard choice for LOESS.
    """
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = (1 - np.abs(u[mask])**3)**3
    return w
 
def epanechnikov_weight(u: np.ndarray) -> np.ndarray:
    """
    Epanechnikov (parabolic) weight: W(u) = 3/4 * (1 - u^2) for |u| < 1.
    Optimal in certain asymptotic senses.
    """
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = 0.75 * (1 - u[mask]**2)
    return w
 
def gaussian_weight(u: np.ndarray, sigma: float = 1/3) -> np.ndarray:
    """
    Gaussian weight: W(u) = exp(-u^2 / (2*sigma^2)).
    Infinite support but decays rapidly.
    """
    return np.exp(-u**2 / (2 * sigma**2))
 
def uniform_weight(u: np.ndarray) -> np.ndarray:
    """
    Uniform (box) weight: W(u) = 1 for |u| < 1, else 0.
    Simplest but causes discontinuous predictions.
    """
    return (np.abs(u) < 1).astype(float)
 
# Visualize the weight functions
u = np.linspace(-1.5, 1.5, 500)
 
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(u, tricube_weight(u), 'b-', lw=2.5, label='Tri-cube (LOESS standard)')
ax.plot(u, epanechnikov_weight(u), 'g--', lw=2, label='Epanechnikov')
ax.plot(u, gaussian_weight(u), 'r-.', lw=2, label='Gaussian (σ=1/3)')
ax.plot(u, uniform_weight(u), 'm:', lw=2, label='Uniform (box)')
 
ax.axhline(0, color='gray', lw=0.5)
ax.axvline(0, color='gray', lw=0.5)
ax.axvline(-1, color='gray', lw=0.5, ls='--', alpha=0.5)
ax.axvline(1, color='gray', lw=0.5, ls='--', alpha=0.5)
 
ax.set_xlabel('Normalized distance u = (x - x₀) / h', fontsize=12)
ax.set_ylabel('Weight W(u)', fontsize=12)
ax.set_title('Weight Functions for Local Regression', fontsize=14)
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-0.1, 1.1)
 
plt.tight_layout()
plt.show()

Bandwidth/Span Selection Guidelines
Span (α)	Neighborhood	Bias	Variance	Use Case
0.1 - 0.2	Very local	Low	High	Complex, wiggly patterns; large datasets
0.3 - 0.5	Moderate	Moderate	Moderate	General purpose; balanced tradeoff
0.5 - 0.7	Broad	Higher	Lower	Smoother trends; noisy data
0.8 - 1.0	Very broad	High	Very low	Captures only major trends; small datasets

Local Polynomial Degrees

Local Linear (p=1) Advantages

•Boundary correction: No systematic bias at edges
•Efficiency: Good bias-variance tradeoff
•Interpretability: Local slope has meaning
•Stability: Moderate data requirements
•Standard choice: Default in most implementations

Local Constant (p=0) Issues

•Boundary bias: Estimates pulled toward interior
•Design bias: Systematic error in non-uniform designs
•Slope blindness: Cannot capture local trends
•Historical artifact: Mainly of theoretical interest
•Rarely optimal: Use local linear instead

Odd vs. Even Degrees

The Complete LOESS Algorithm

loess_implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Tuple, Optional
 
def tricube(u: np.ndarray) -> np.ndarray:
    """Tri-cube weight function."""
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = (1 - np.abs(u[mask])**3)**3
    return w
 
def loess_fit(x: np.ndarray, y: np.ndarray, 
              x_eval: Optional[np.ndarray] = None,
              span: float = 0.75,
              degree: int = 1) -> Tuple[np.ndarray, np.ndarray]:
    """
    LOESS (Locally Estimated Scatterplot Smoothing).
    
    Parameters:
        x: Input values (n,)
        y: Target values (n,)
        x_eval: Points at which to evaluate (m,). If None, use x.
        span: Fraction of data to use in each local fit (0 < span <= 1)
        degree: Local polynomial degree (0, 1, or 2)
        
    Returns:
        x_eval: Evaluation points
        y_hat: Fitted values at evaluation points
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    if x_eval is None:
        x_eval = x.copy()
    else:
        x_eval = np.asarray(x_eval).flatten()
    
    m = len(x_eval)
    y_hat = np.zeros(m)
    
    # Number of nearest neighbors
    k = int(np.ceil(span * n))
    k = max(k, degree + 1)  # Need at least degree+1 points for a fit
    k = min(k, n)           # Can't use more points than we have
    
    for j, x0 in enumerate(x_eval):
        # Step 1: Find distances to all points
        distances = np.abs(x - x0)
        
        # Step 2: Find the k-th smallest distance (bandwidth)
        sorted_distances = np.sort(distances)
        h = sorted_distances[k - 1]
        
        # Prevent zero bandwidth
        if h < 1e-10:
            h = 1e-10
        
        # Step 3: Compute weights
        u = distances / h
        w = tricube(u)
        
        # Step 4: Local polynomial fit
        # Build design matrix centered at x0
        X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
        
        # Weighted least squares: (X'WX)^{-1} X'Wy
        W = np.diag(w)
        XtW = X_local.T @ W
        XtWX = XtW @ X_local
        XtWy = XtW @ y
        
        # Solve the normal equations
        try:
            beta = np.linalg.solve(XtWX, XtWy)
        except np.linalg.LinAlgError:
            # Singular matrix - use pseudoinverse
            beta = np.linalg.lstsq(X_local * w[:, np.newaxis], y * w, rcond=None)[0]
        
        # Step 5: Prediction is the intercept (since we centered at x0)
        y_hat[j] = beta[0]
    
    return x_eval, y_hat
 
# =============================================================================
# DEMONSTRATION
# =============================================================================
np.random.seed(42)
 
# Generate data with a complex pattern
n = 100
x = np.sort(np.random.uniform(0, 2*np.pi, n))
y_true = np.sin(x) + 0.5 * np.sin(3*x)  # True function
y = y_true + np.random.normal(0, 0.3, n)  # Add noise
 
# Fit LOESS with different spans
spans = [0.2, 0.4, 0.7]
x_fine = np.linspace(0, 2*np.pi, 200)
 
print("LOESS Demonstration")
print("=" * 50)
print(f"{'Span':>8} | {'Mean Squared Error':>20}")
print("-" * 50)
 
for span in spans:
    _, y_hat = loess_fit(x, y, x_eval=x_fine, span=span, degree=1)
    _, y_hat_train = loess_fit(x, y, span=span, degree=1)
    mse = np.mean((y - y_hat_train)**2)
    print(f"{span:>8.2f} | {mse:>20.6f}")

Computational Complexity

Robust LOESS: Resistance to Outliers

robust_loess.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from typing import Tuple
 
def bisquare(u: np.ndarray) -> np.ndarray:
    """Bisquare weight function for robustifying."""
    u = np.asarray(u)
    w = np.zeros_like(u, dtype=float)
    mask = np.abs(u) < 1
    w[mask] = (1 - u[mask]**2)**2
    return w
 
def robust_loess(x: np.ndarray, y: np.ndarray,
                 span: float = 0.75,
                 degree: int = 1,
                 robustifying_iterations: int = 3) -> Tuple[np.ndarray, np.ndarray]:
    """
    Robust LOESS with iterative reweighting for outlier resistance.
    
    Parameters:
        x: Input values
        y: Target values
        span: Fraction of data used in local fits
        degree: Local polynomial degree
        robustifying_iterations: Number of robustifying iterations (0 = none)
        
    Returns:
        x: Input values (sorted)
        y_hat: Fitted values
    """
    x = np.asarray(x).flatten()
    y = np.asarray(y).flatten()
    n = len(x)
    
    # Sort by x for consistency
    order = np.argsort(x)
    x = x[order]
    y = y[order]
    
    k = int(np.ceil(span * n))
    k = max(k, degree + 1)
    k = min(k, n)
    
    # Initialize robustness weights to 1
    delta = np.ones(n)
    
    for iteration in range(robustifying_iterations + 1):
        y_hat = np.zeros(n)
        
        for j in range(n):
            x0 = x[j]
            
            # Distances and bandwidth
            distances = np.abs(x - x0)
            sorted_distances = np.sort(distances)
            h = max(sorted_distances[k - 1], 1e-10)
            
            # Combined weights: locality * robustness
            w = (1 - np.abs((x - x0) / h)**3)**3 * (np.abs((x - x0) / h) < 1)
            w = w * delta  # Apply robustness weights
            
            # Local polynomial fit
            X_local = np.column_stack([(x - x0)**p for p in range(degree + 1)])
            
            try:
                W = np.diag(w)
                XtWX = X_local.T @ W @ X_local
                XtWy = X_local.T @ W @ y
                beta = np.linalg.solve(XtWX, XtWy)
            except:
                # Fallback for singular cases
                beta = np.zeros(degree + 1)
                beta[0] = np.average(y, weights=w + 1e-10)
            
            y_hat[j] = beta[0]
        
        # Compute robustness weights for next iteration
        if iteration < robustifying_iterations:
            residuals = y - y_hat
            s = np.median(np.abs(residuals))  # Median absolute deviation
            if s > 1e-10:
                u = residuals / (6.0 * s)
                delta = bisquare(u)
            else:
                break  # Perfect fit, no need to continue
    
    return x, y_hat
 
# Demonstration with outliers
np.random.seed(42)
n = 80
x = np.sort(np.random.uniform(0, 4, n))
y_true = 2 * np.sin(x)
y = y_true + np.random.normal(0, 0.3, n)
 
# Add some outliers
outlier_indices = [10, 25, 50, 65]
y[outlier_indices] += np.array([3, -4, 5, -3.5])
 
# Compare standard and robust LOESS
_, y_standard = robust_loess(x, y, span=0.3, robustifying_iterations=0)
_, y_robust = robust_loess(x, y, span=0.3, robustifying_iterations=3)
 
mse_standard = np.mean((y_standard - y_true)**2)
mse_robust = np.mean((y_robust - y_true)**2)
 
print("Robust LOESS vs Standard LOESS (with 4 outliers)")
print("=" * 50)
print(f"Standard LOESS MSE: {mse_standard:.4f}")
print(f"Robust LOESS MSE:   {mse_robust:.4f}")
print(f"Improvement:        {100*(mse_standard - mse_robust)/mse_standard:.1f}%")

When to Use Robust LOESS

Theoretical Properties

Comparison of Convergence Rates
Method	Convergence Rate	Required n for ε=0.01 error
Parametric (correctly specified)	O(n⁻¹)	~100
Local linear regression	O(n⁻⁴ᐟ⁵)	~3,200
Local constant (Nadaraya-Watson)	O(n⁻²ᐟ³)	~10,000
Nonparametric in d dimensions	O(n⁻⁴ᐟ⁽⁴⁺ᵈ⁾)	Explodes with d

Degrees of Freedom

Summary: Local Regression (LOESS)

We've established the complete framework for local regression, from intuition to rigorous implementation. Let's consolidate the key concepts:

Key Takeaways

•Local philosophy: Fit simple models in neighborhoods rather than one complex global model. This captures varying local structure without committing to a global functional form.
•Weighted least squares: Each prediction uses WLS with weights that decay with distance. The tri-cube kernel is standard for its compact support and smoothness.
•Local linear (degree 1): The preferred choice due to automatic boundary correction. Higher degrees rarely improve performance.
•Span parameter: Controls bias-variance tradeoff. Smaller span → more flexible, higher variance; larger span → more smooth, higher bias.
•Adaptive bandwidth: Using nearest-neighbor fraction rather than fixed bandwidth adapts to local data density.
•Robust LOESS: Iterative reweighting using bisquare weights provides resistance to outliers.
•Computational cost: O(n²) for full evaluation; practical strategies exist for large datasets.

Page Complete

1 / 5