Nonparametric Regression - Learning Module

Loading content...

0/245

Curse of Dimensionality

When Intuition Fails: High-Dimensional Spaces

Everything we've learned about nonparametric regression works beautifully in one or two dimensions. But a profound challenge emerges as the number of predictors grows: the curse of dimensionality.\n\nThis evocative term, coined by Richard Bellman in 1961, describes a collection of phenomena that make high-dimensional spaces behave in ways that defy our low-dimensional intuition. In the context of nonparametric regression, the curse manifests as an exponential degradation in estimation quality—or equivalently, an exponential increase in data requirements.\n\nUnderstanding the curse is not just academic. It fundamentally shapes which methods are practical for which problems, explains why modern machine learning uses regularization and structure so heavily, and illuminates the tradeoffs between parametric and nonparametric approaches.

What You Will Learn

By the end of this page, you will understand: (1) What the curse of dimensionality means precisely; (2) Why local neighborhoods become 'global' in high dimensions; (3) The mathematical foundation—convergence rates and sample complexity; (4) Geometric intuitions for high-dimensional phenomena; (5) How the curse limits nonparametric regression; (6) Strategies for mitigating the curse.

The Fundamental Problem: Data Sparsity

The Core Issue:\n\nNonparametric regression relies on local averaging—estimating the function at a point by averaging nearby observations. For this to work, we need enough data points in every local neighborhood to get reliable averages.\n\nIn low dimensions, a 'local' neighborhood is genuinely small relative to the data domain. But as dimensions increase, 'local' neighborhoods must grow exponentially to contain the same number of points. Eventually, 'local' neighborhoods span the entire dataset—and local averaging becomes global averaging, which cannot capture local structure.\n\nThe Volume Growth Problem:\n\nConsider the $d$-dimensional unit hypercube $[0, 1]^d$. A 'local' neighborhood is a ball of radius $r$ centered at some point. The volume of this ball is:\n\n$$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$\n\nAs $d$ increases, the fraction of the hypercube occupied by any fixed ball shrinks exponentially.

volume_geometry.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
from scipy.special import gamma
import matplotlib.pyplot as plt
 
def ball_volume(r: float, d: int) -> float:
    """Volume of a d-dimensional ball of radius r."""
    return (np.pi ** (d/2) / gamma(d/2 + 1)) * r**d
 
def cube_volume(d: int) -> float:
    """Volume of d-dimensional unit cube [0,1]^d."""
    return 1.0  # Always 1
 
def fraction_in_ball(r: float, d: int) -> float:
    """Fraction of unit cube volume covered by ball of radius r at center."""
    # Ball might extend outside cube, so we approximate with direct calculation
    # For small r relative to cube, this is approximately ball_volume / cube_volume
    return min(1.0, ball_volume(r, d))
 
# How the "reach" of a local neighborhood grows
print("The Emptiness of High-Dimensional Space")
print("=" * 60)
print("\nTo capture 1% of volume (and ~1% of uniformly distributed data):")
print("-" * 60)
print(f"{'Dimension d':<15} | {'Required radius r':<20} | {'r as % of [0,1]'}")
print("-" * 60)
 
for d in [1, 2, 3, 5, 10, 20, 50, 100]:
    # Find r such that ball volume ≈ 0.01 * cube volume
    # V_d(r) = 0.01 → r = (0.01 * Γ(d/2+1) / π^(d/2))^(1/d)
    target_fraction = 0.01
    r = (target_fraction * gamma(d/2 + 1) / np.pi**(d/2)) ** (1/d)
    
    print(f"{d:<15} | {r:<20.4f} | {100*r:<.1f}%")
 
print("\n⚠ In 100D, a 'local' ball capturing 1% of data extends ~90% ")
print("  of the way along each axis! 'Local' becomes 'global'.")

The Exponential Curse

In $d = 10$ dimensions, to capture just 1% of the volume (and roughly 1% of data points), you need a ball with radius $r \approx 0.63$—extending 63% of the way along each axis. This is hardly 'local'! The neighborhood includes points that are nearly at opposite corners of the space.

Convergence Rate Degradation

The Mathematical Foundation:\n\nFor kernel regression (or any 'reasonable' nonparametric estimator) in $d$ dimensions estimating a function $f$ with $s$ continuous derivatives, the optimal mean integrated squared error (MISE) is:\n\n$$\text{MISE}_{\text{opt}} = O\left( n^{-\frac{2s}{2s + d}} \right)$$\n\nThis is the Stone (1982) optimal rate—no estimator can do better without additional assumptions.\n\nInterpretation:\n\n- Numerator $2s$: Smoothness of the function (more derivatives = faster convergence)\n- Denominator $2s + d$: Smoothness plus dimension\n\nAs $d$ grows relative to $s$, the rate degrades toward $n^0 = O(1)$—meaning error doesn't decrease much with more data!

Optimal MISE Convergence Rates by Dimension (s=2 derivatives)
Dimension d	Convergence Rate	n for MSE = 0.01	Ratio vs d=1
1	$O(n^{-4/5}) = O(n^{-0.80})$	~100	1x
2	$O(n^{-2/3}) = O(n^{-0.67})$	~1,000	10x
3	$O(n^{-4/7}) \approx O(n^{-0.57})$	~6,000	60x
5	$O(n^{-4/9}) \approx O(n^{-0.44})$	~100,000	1,000x
10	$O(n^{-2/7}) \approx O(n^{-0.29})$	~10 million	100,000x
20	$O(n^{-2/12}) \approx O(n^{-0.17})$	~$10^{12}$	~$10^{10}$x

Sample Complexity Explosion:\n\nTo halve the estimation error, you need to increase sample size by a factor of $2^{(2s+d)/2s}$.\n\nFor $s = 2$ (twice differentiable function):\n- $d = 1$: Need $2^{5/4} \approx 2.4$ times more data to halve error\n- $d = 5$: Need $2^{9/4} \approx 4.8$ times more data\n- $d = 10$: Need $2^{14/4} \approx 11.3$ times more data\n- $d = 20$: Need $2^{24/4} \approx 64$ times more data\n\nGetting meaningful improvement in high dimensions requires exponentially more data.

convergence_simulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from typing import Callable, List, Tuple
 
def simulate_kernel_regression_error(
    d: int,
    n_values: List[int],
    n_trials: int = 20,
    noise_std: float = 0.3
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Simulate kernel regression in d dimensions and measure MSE.
    
    True function: f(x) = sum of sines (smooth, d-dimensional)
    """
    mse_means = []
    mse_stds = []
    
    def f_true(X):
        """Smooth additive function for testing."""
        return np.sum(np.sin(2 * X), axis=1)
    
    def kernel_regression_predict(X_train, y_train, X_test, h):
        """Simple Nadaraya-Watson in d dimensions."""
        n_test = len(X_test)
        y_pred = np.zeros(n_test)
        
        for i in range(n_test):
            # Euclidean distances
            diff = X_train - X_test[i]
            distances = np.sqrt(np.sum(diff**2, axis=1))
            
            # Gaussian kernel
            weights = np.exp(-distances**2 / (2 * h**2))
            weight_sum = np.sum(weights)
            
            if weight_sum > 1e-10:
                y_pred[i] = np.sum(weights * y_train) / weight_sum
            else:
                y_pred[i] = np.mean(y_train)
        
        return y_pred
    
    for n in n_values:
        trial_mses = []
        
        for trial in range(n_trials):
            # Generate data
            X_train = np.random.uniform(0, 2*np.pi, (n, d))
            y_train = f_true(X_train) + np.random.normal(0, noise_std, n)
            
            # Test points
            n_test = 100
            X_test = np.random.uniform(0, 2*np.pi, (n_test, d))
            y_test_true = f_true(X_test)
            
            # Use rule-of-thumb bandwidth scaled for dimension
            sigma_x = 2*np.pi / np.sqrt(12)  # std of uniform on [0, 2π]
            h = sigma_x * n**(-1/(4 + d))  # Optimal rate scaling
            
            # Predict
            y_pred = kernel_regression_predict(X_train, y_train, X_test, h)
            
            mse = np.mean((y_pred - y_test_true)**2)
            trial_mses.append(mse)
        
        mse_means.append(np.mean(trial_mses))
        mse_stds.append(np.std(trial_mses))
    
    return np.array(mse_means), np.array(mse_stds)
 
# =============================================================================
# Run simulation for different dimensions
# =============================================================================
np.random.seed(42)
 
dimensions = [1, 2, 5, 10]
n_values = [50, 100, 200, 500, 1000, 2000]
 
print("Kernel Regression MSE vs Sample Size and Dimension")
print("=" * 70)
print(f"{'n':<8}", end="")
for d in dimensions:
    print(f"{'d=' + str(d):<12}", end="")
print()
print("-" * 70)
 
results = {}
for d in dimensions:
    mse_means, mse_stds = simulate_kernel_regression_error(d, n_values, n_trials=10)
    results[d] = mse_means
 
for i, n in enumerate(n_values):
    print(f"{n:<8}", end="")
    for d in dimensions:
        print(f"{results[d][i]:<12.4f}", end="")
    print()
 
print("\n⚠ Note how MSE barely improves with n in high dimensions!")
print("  This is the curse of dimensionality in action.")

Geometric Intuitions for High Dimensions

Counter-Intuitive Phenomena:\n\nHigh-dimensional geometry violates our 3D intuitions in profound ways. Understanding these phenomena helps explain why nonparametric methods struggle.\n\n1. Points Concentrate Near the Surface\n\nIn a high-dimensional ball, almost all the volume is near the surface. For a ball of radius $r$, the fraction of volume within radius $(1-\epsilon)r$ is:\n\n$$\frac{V_d((1-\epsilon)r)}{V_d(r)} = (1-\epsilon)^d \to 0 \quad \text{as } d \to \infty$$\n\nFor $d = 100$ and $\epsilon = 0.1$: $(0.9)^{100} \approx 0.00003$.\n\nImplication: In high-D, all data points are roughly equidistant from the center. The concept of 'distance from center' loses meaning.

2. All Points Are Far From Each Other\n\nThe expected distance between two random points in the unit hypercube grows as $\sqrt{d/6}$. Meanwhile, the typical distance from a point to its nearest neighbor grows as:\n\n$$\mathbb{E}[\text{NN distance}] \sim \frac{d^{1/d}}{n^{1/d}} \to 1 \quad \text{as } d \to \infty$$\n\nFor large $d$, the nearest neighbor is almost as far as the average point!\n\n3. Corners Dominate\n\nIn the unit hypercube $[0,1]^d$, most volume is near the $2^d$ corners. The fraction of volume within distance $\epsilon$ of any face is roughly $d \epsilon$ (small), while the diagonal has length $\sqrt{d}$ (large).\n\n4. Random Vectors Are Nearly Orthogonal\n\nThe angle between two random unit vectors in $\mathbb{R}^d$ concentrates around $90°$ as $d \to \infty$. The inner product has distribution approaching $N(0, 1/d)$.

high_dim_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def demonstrate_distance_concentration(dimensions: list, n_points: int = 1000):
    """
    Show how distances concentrate in high dimensions.
    """
    print("Distance Concentration in High Dimensions")
    print("=" * 70)
    print("Random points in unit hypercube [0,1]^d")
    print("-" * 70)
    print(f"{'Dim d':<8} | {'Mean NN dist':<15} | {'Mean all-pairs dist':<20} | {'Ratio':<10}")
    print("-" * 70)
    
    for d in dimensions:
        # Generate random points
        X = np.random.uniform(0, 1, (n_points, d))
        
        # Compute all pairwise distances
        all_distances = []
        nn_distances = []
        
        for i in range(min(n_points, 500)):  # Limit for speed
            dists = np.sqrt(np.sum((X - X[i])**2, axis=1))
            dists[i] = np.inf  # Exclude self
            
            nn_distances.append(np.min(dists))
            all_distances.extend(dists[dists < np.inf])
        
        mean_nn = np.mean(nn_distances)
        mean_all = np.mean(all_distances)
        ratio = mean_nn / mean_all
        
        print(f"{d:<8} | {mean_nn:<15.4f} | {mean_all:<20.4f} | {ratio:<10.4f}")
    
    print("\n⚠ As d increases, nearest neighbors become almost as far as average!")
    print("  'Nearest' loses meaning—all points are effectively far from each other.")
 
def demonstrate_volume_in_shell(dimensions: list):
    """
    Show how volume concentrates in thin shells in high dimensions.
    """
    print("\n\nVolume Concentration in Thin Shells")
    print("=" * 70)
    print("Fraction of unit ball volume in outer 10% shell (r ∈ [0.9, 1.0])")
    print("-" * 70)
    
    for d in dimensions:
        inner_frac = 0.9 ** d  # Volume within r=0.9 as fraction of r=1.0
        shell_frac = 1 - inner_frac  # Volume in shell
        
        print(f"d = {d:>3}: {100*shell_frac:>6.2f}% of volume in outer 10% shell")
    
    print("\n⚠ In high dimensions, almost ALL volume is near the surface!")
 
# Run demonstrations
np.random.seed(42)
demonstrate_distance_concentration([1, 2, 5, 10, 20, 50, 100])
demonstrate_volume_in_shell([1, 2, 5, 10, 20, 50, 100, 500])

The Practical Impact

These geometric facts explain why kernel smoothing fails in high dimensions. If all points are roughly equidistant, kernel weights become nearly uniform, and local averaging becomes global averaging. We lose the ability to capture local structure—exactly what nonparametric regression is supposed to do.

Implications for Nonparametric Regression

The Death of Local Averaging:\n\nIn $d$ dimensions with $n$ data points, a typical point-to-nearest-neighbor distance is:\n\n$$\rho_n \sim n^{-1/d}$$\n\nFor a kernel smoother with bandwidth $h$, we want $h$ small (to be local) but large enough to include many points (for stable averaging).\n\nThe fundamental tension:\n- In $d = 1$: Points spaced $\sim 1/n$; bandwidth $h \sim n^{-1/5}$ includes $\sim n^{4/5}$ points. Plenty!\n- In $d = 10$: Points spaced $\sim n^{-0.1}$; with the same relative bandwidth, we include $\sim n^{0.1}$ points—potentially just a handful.\n\nTo maintain a fixed fraction of points in the neighborhood as $d$ increases, the neighborhood must grow to encompass the entire dataset.

Bandwidth Scaling in High Dimensions:\n\nThe optimal bandwidth for $d$-dimensional regression scales as:\n\n$$h_{\text{opt}} \sim n^{-1/(2s + d)}$$\n\nFor $s = 2$ derivatives:\n- $d = 1$: $h \sim n^{-1/5}$ (shrinks slowly)\n- $d = 10$: $h \sim n^{-1/14} \approx n^{-0.07}$ (shrinks very slowly)\n- $d = 100$: $h \sim n^{-1/104} \approx n^{-0.01}$ (barely shrinks at all)\n\nWith $n = 1000$ and $d = 100$: $h \approx 1000^{-0.01} \approx 0.93$.\n\nThe optimal bandwidth is nearly the entire data range! The 'local' fit becomes global.

Required Sample Size for Nonparametric Regression (Target MSE = 0.01)
Dimension d	Stone Rate	Required n	Practical?
1	$n^{-0.80}$	~100	✓ Easy
2	$n^{-0.67}$	~1,000	✓ Easy
3	$n^{-0.57}$	~10,000	✓ Feasible
5	$n^{-0.44}$	~100,000	⚠ Challenging
10	$n^{-0.29}$	~10 million	✗ Very difficult
20	$n^{-0.17}$	~$10^{12}$	✗ Impossible
100	$n^{-0.04}$	~$10^{50}$	✗ Science fiction

The Curse is Fundamental

These sample requirements aren't artifacts of bad algorithms—they're optimal given the problem structure. No clever estimator can beat Stone's rate without additional assumptions (like smoothness structure, sparsity, or additive forms). The curse is a property of the problem, not the solution.

Strategies for Mitigating the Curse

Escaping the Curse Requires Structure:\n\nThe curse of dimensionality is inescapable for truly arbitrary high-dimensional functions. But real-world data rarely exhibits arbitrary complexity. By exploiting structure, we can evade the curse.\n\n1. Additive Models (GAMs):\n\nAssume the function decomposes as:\n$$f(x_1, \ldots, x_d) = \sum_{j=1}^{d} f_j(x_j)$$\n\nNow we have $d$ one-dimensional problems, each with the favorable $n^{-4/5}$ rate. The curse is eliminated at the cost of assuming no interactions.\n\n2. Projection Pursuit / Single Index Models:\n\nAssume:\n$$f(\mathbf{x}) = g(\boldsymbol{\beta}^T \mathbf{x})$$\n\nThe function depends on $\mathbf{x}$ only through a single linear combination. Again, this reduces to 1D estimation plus direction estimation.

3. Variable Selection / Sparsity:\n\nAssume only $s \ll d$ features actually matter. Then:\n- Optimal rate becomes $n^{-4/5}$ in the $s$-dimensional subspace\n- Challenge: Finding which $s$ features matter\n- Methods: Lasso-based selection + nonparametric fitting\n\n4. Low-Rank / Manifold Structure:\n\nAssume data lies on a low-dimensional manifold embedded in $\mathbb{R}^d$:\n- If manifold dimension is $m \ll d$, rates depend on $m$, not $d$\n- Methods: Local linear embedding, ISOMAP, manifold-based regression\n\n5. Regularization and Kernel Methods:\n\n- RKHSs (Reproducing Kernel Hilbert Spaces) impose smoothness\n- SVMs and kernel ridge regression can achieve dimension-independent rates for specific function classes\n- The kernel implicitly defines the 'allowed' function class

additive_model_escape.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from typing import Tuple
 
def additive_kernel_regression(
    X: np.ndarray,
    y: np.ndarray,
    X_test: np.ndarray,
    h: float = None
) -> np.ndarray:
    """
    Additive model via backfitting with univariate kernel smoothers.
    
    f(x_1, ..., x_d) = μ + f_1(x_1) + ... + f_d(x_d)
    
    This ESCAPES the curse of dimensionality by solving d 
    one-dimensional problems instead of one d-dimensional problem.
    """
    n, d = X.shape
    n_test = X_test.shape[0]
    
    if h is None:
        h = 1.06 * np.std(X) * n**(-0.2)  # Rule of thumb for 1D
    
    def smooth_1d(x, y, x_new, h):
        """Nadaraya-Watson in 1D."""
        y_hat = np.zeros(len(x_new))
        for i, x0 in enumerate(x_new):
            u = (x - x0) / h
            w = np.exp(-u**2 / 2)  # Gaussian kernel
            if np.sum(w) > 1e-10:
                y_hat[i] = np.sum(w * y) / np.sum(w)
            else:
                y_hat[i] = np.mean(y)
        return y_hat
    
    # Initialize: overall mean
    mu = np.mean(y)
    f_components = [np.zeros(n) for _ in range(d)]
    
    # Backfitting iterations
    max_iter = 10
    for iteration in range(max_iter):
        for j in range(d):
            # Partial residual: y minus all other components
            partial_residual = y - mu - sum(f for k, f in enumerate(f_components) if k != j)
            
            # Smooth partial residual against x_j
            f_components[j] = smooth_1d(X[:, j], partial_residual, X[:, j], h)
            
            # Center the component
            f_components[j] -= np.mean(f_components[j])
    
    # Predict on test set
    y_test = np.full(n_test, mu)
    for j in range(d):
        # Train smoothed function maps X[:,j] -> f_j
        # We need to interpolate for test points
        partial_residual = y - mu - sum(f for k, f in enumerate(f_components) if k != j)
        y_test += smooth_1d(X[:, j], partial_residual, X_test[:, j], h)
    
    return y_test
 
# =============================================================================
# Demonstration: Additive vs Full Nonparametric in High Dimensions
# =============================================================================
np.random.seed(42)
 
d = 10  # 10 dimensions - challenging for full nonparametric
n_train = 500
n_test = 100
 
# True additive function (sum of sines)
f_true = lambda X: np.sum(np.sin(X), axis=1)
 
# Generate data
X_train = np.random.uniform(0, 2*np.pi, (n_train, d))
y_train = f_true(X_train) + np.random.normal(0, 0.3, n_train)
X_test = np.random.uniform(0, 2*np.pi, (n_test, d))
y_test_true = f_true(X_test)
 
# Additive model (exploits structure)
y_pred_additive = additive_kernel_regression(X_train, y_train, X_test, h=0.5)
 
# Full kernel regression (doesn't exploit structure)
def full_kernel_regression(X_train, y_train, X_test, h):
    y_pred = np.zeros(len(X_test))
    for i in range(len(X_test)):
        diff = X_train - X_test[i]
        dist = np.sqrt(np.sum(diff**2, axis=1))
        w = np.exp(-dist**2 / (2*h**2))
        if np.sum(w) > 1e-10:
            y_pred[i] = np.sum(w * y_train) / np.sum(w)
        else:
            y_pred[i] = np.mean(y_train)
    return y_pred
 
# Use a reasonable bandwidth for d=10
h_full = 2.0 * n_train**(-1/14)  # Optimal rate scaling for d=10
y_pred_full = full_kernel_regression(X_train, y_train, X_test, h_full)
 
mse_additive = np.mean((y_pred_additive - y_test_true)**2)
mse_full = np.mean((y_pred_full - y_test_true)**2)
 
print("Escaping the Curse via Additive Structure (d=10)")
print("=" * 60)
print(f"Additive Model MSE: {mse_additive:.4f}")
print(f"Full Kernel MSE:    {mse_full:.4f}")
print(f"Improvement ratio:  {mse_full / mse_additive:.1f}x")
print("\n⚠ When the truth IS additive, the additive model wins dramatically!")
print("  It solves 10 easy 1D problems instead of one impossible 10D problem.")

The Takeaway

The curse of dimensionality is escapable if you can identify and exploit structure. Additive models, sparse regression, low-rank approximations, and manifold learning all work by reducing effective dimensionality. The key is matching your method to your data's true structure—or having enough domain knowledge to impose appropriate structure.

Practical Guidelines by Dimension

When is Nonparametric Regression Viable?\n\nBased on the theoretical analysis and practical experience, here are guidelines for when pure nonparametric methods are appropriate:

Nonparametric Regression Feasibility by Dimension
Dimension d	Sample Size Needed	Recommendation
1-2	50-500	Use nonparametric freely; LOESS, kernel smoothing all work well
3-4	500-5,000	Nonparametric viable; consider thin plate splines or tensor products
5-8	5,000-100,000	Use structured models (additive, low-rank); full nonparametric marginal
9-20	100,000+	Additive models, single-index models, or parametric with basis expansion
20+	1 million	Use regularized parametric models, neural nets, or tree ensembles; pure nonparametric infeasible

Nonparametric Appropriate When

•Dimension $d \leq 3-4$
•Large sample size relative to $d$
•Unknown functional form
•Exploratory data analysis
•Visualization purposes
•Benchmark for parametric models

Consider Alternatives When

•Dimension $d > 5$ without structure
•Limited sample size
•Computational constraints
•Need for extrapolation
•Highly structured (additive, sparse) relationships
•Production systems requiring speed

Modern Approaches to High Dimensions:\n\nFor high-dimensional regression, modern machine learning has developed approaches that implicitly or explicitly manage the curse:\n\n- Regularized linear models (Ridge, Lasso): Assume near-linear relationships; work in arbitrary $d$\n- Random forests / Gradient boosting: Adaptive, handle interactions well, but still struggle with truly high $d$\n- Neural networks: Surprisingly good at finding low-dimensional structure; implicit regularization helps\n- Gaussian processes with structured kernels: Can encode additive/multiplicative structure\n- Dimension reduction + nonparametric: PCA/PLS first, then nonparametric on reduced dimensions

Summary: Curse of Dimensionality

We've explored the fundamental challenge of high-dimensional nonparametric regression. Let's consolidate the key insights:

Key Takeaways

•Data becomes sparse exponentially: In $d$ dimensions, data points spread out, making local neighborhoods increasingly empty or global.
•Convergence rates degrade: From $n^{-4/5}$ in 1D to $n^{-4/(4+d)}$—approaching $n^0$ as $d$ grows. More data helps less.
•Sample complexity explodes: Achieving fixed accuracy requires exponentially more data in $d$.
•Geometric intuition fails: In high-D, all points are 'far' from each other; volume concentrates in shells; angles concentrate at 90°.
•The curse is fundamental: Stone's theorem proves no estimator can beat these rates without additional structure.
•Structure is the escape: Additive models, sparsity, low-rank/manifold assumptions restore feasibility by reducing effective dimension.
•Practical limits: Pure nonparametric is practical for $d \leq 3-4$; beyond that, use structured alternatives.

What's Next:\n\nWe've seen where nonparametric regression thrives and where it struggles. The final page pulls everything together with practical guidelines for when to use nonparametric methods—comparing them to parametric alternatives and providing decision frameworks for real-world applications.

Page Complete

You now understand why nonparametric regression struggles in high dimensions—and why this isn't a failure of algorithms but a fundamental property of the problem. This knowledge helps you choose appropriate methods: nonparametric for low-D exploration, structured methods for high-D inference. Next, we synthesize these insights into practical decision guidelines.