Loading content...
In the previous page, we studied kernel functions—the building blocks that determine the shape of local contributions in kernel density estimation. We saw that the choice of kernel has surprisingly little effect on estimation accuracy. So what does matter?
The answer is the bandwidth parameter $h$. The bandwidth controls the width of each kernel contribution, and its choice fundamentally determines the quality of the density estimate. But here's the critical insight: there is no universally 'best' bandwidth. Instead, we face a fundamental bias-variance tradeoff that governs all nonparametric estimation.
This tradeoff is not merely a technical detail—it is the theoretical foundation upon which all bandwidth selection methods are built. Understanding it deeply is essential for anyone who wants to do more than blindly apply default settings.
By the end of this page, you will understand the mathematical decomposition of Mean Squared Error into bias and variance components, how bandwidth affects each component in opposing ways, the asymptotic expressions for bias and variance, how to combine these into Mean Integrated Squared Error (MISE), the optimal bandwidth derivation and its dependence on the unknown density, and the practical implications for KDE applications.
Let's begin by analyzing the error of our kernel density estimator at a single point $x$. Recall that given i.i.d. observations $X_1, \ldots, X_n$ from an unknown density $f$, the KDE is:
$$\hat{f}h(x) = \frac{1}{nh} \sum{i=1}^{n} K\left(\frac{x - X_i}{h}\right)$$
The natural measure of error at point $x$ is the Mean Squared Error (MSE):
$$\text{MSE}(\hat{f}_h(x)) = \mathbb{E}\left[(\hat{f}_h(x) - f(x))^2\right]$$
The Bias-Variance Decomposition:
This MSE can be decomposed into two fundamental components:
$$\text{MSE}(\hat{f}_h(x)) = \underbrace{\left(\mathbb{E}[\hat{f}h(x)] - f(x)\right)^2}{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}h(x))}{\text{Variance}}$$
This is the famous bias-variance decomposition. Let's understand each term.
Bias measures systematic error—how far the average estimate is from the truth. If you repeated the estimation infinitely many times with different samples, bias is the gap between the average of your estimates and the true value. Variance measures random error—how much individual estimates fluctuate around their average. High variance means estimates are unstable and sensitive to the particular sample.
Deriving the Bias:
The expected value of the KDE at point $x$ is:
$$\mathbb{E}[\hat{f}_h(x)] = \frac{1}{h} \mathbb{E}\left[K\left(\frac{x - X_1}{h}\right)\right] = \frac{1}{h} \int K\left(\frac{x - t}{h}\right) f(t) , dt$$
Using the substitution $u = (x - t)/h$:
$$\mathbb{E}[\hat{f}_h(x)] = \int K(u) f(x - hu) , du$$
Now, we perform a Taylor expansion of $f(x - hu)$ around $x$:
$$f(x - hu) = f(x) - hu f'(x) + \frac{(hu)^2}{2} f''(x) - \frac{(hu)^3}{6} f'''(x) + O(h^4)$$
Substituting and using the properties of the kernel ($\int K(u) du = 1$, $\int u K(u) du = 0$ for symmetric kernels):
$$\mathbb{E}[\hat{f}_h(x)] = f(x) + \frac{h^2}{2} f''(x) \mu_2(K) + O(h^4)$$
where $\mu_2(K) = \int u^2 K(u) du$ is the second moment of the kernel.
Therefore, the bias is:
$$\text{Bias}(\hat{f}_h(x)) = \mathbb{E}[\hat{f}_h(x)] - f(x) = \frac{h^2}{2} f''(x) \mu_2(K) + O(h^4)$$
The bias is proportional to $f''(x)$—the curvature of the true density at point $x$. Where the density is highly curved (sharp peaks or valleys), the bias is large. Where the density is nearly linear (low curvature), the bias is small. This makes intuitive sense: averaging over a neighborhood introduces more error when the function is changing rapidly.
Deriving the Variance:
The KDE is an average of $n$ independent and identically distributed random variables:
$$\hat{f}h(x) = \frac{1}{n} \sum{i=1}^{n} \underbrace{\frac{1}{h} K\left(\frac{x - X_i}{h}\right)}_{Y_i}$$
For i.i.d. random variables, $\text{Var}(\bar{Y}) = \text{Var}(Y_1)/n$. So:
$$\text{Var}(\hat{f}_h(x)) = \frac{1}{n} \text{Var}\left(\frac{1}{h} K\left(\frac{x - X_1}{h}\right)\right)$$
Computing this variance:
$$\text{Var}(\hat{f}_h(x)) = \frac{1}{n} \left[\mathbb{E}\left(\frac{1}{h^2} K^2\left(\frac{x - X_1}{h}\right)\right) - \left(\mathbb{E}\left(\frac{1}{h} K\left(\frac{x - X_1}{h}\right)\right)\right)^2\right]$$
For large $n$ (small $h$), the second term is $O(1)$ but the first term has a factor of $1/h$:
$$\mathbb{E}\left(\frac{1}{h^2} K^2\left(\frac{x - X_1}{h}\right)\right) = \frac{1}{h^2} \int K^2\left(\frac{x - t}{h}\right) f(t) , dt = \frac{1}{h} \int K^2(u) f(x - hu) , du$$
$$\approx \frac{f(x)}{h} \int K^2(u) , du = \frac{f(x) R(K)}{h}$$
where $R(K) = \int K^2(u) du$ is the roughness of the kernel.
Therefore, the variance is:
$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$
Now we can see the core of the bias-variance tradeoff. Collecting our asymptotic expressions:
$$\text{Bias}^2(\hat{f}_h(x)) \approx \frac{h^4}{4} \left(f''(x)\right)^2 \mu_2(K)^2$$
$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$
The tension becomes crystal clear:
| Component | Dependence on $h$ | Effect of increasing $h$ |
|---|---|---|
| Bias² | $\propto h^4$ | Increases (overly smooth) |
| Variance | $\propto 1/(nh)$ | Decreases (more stable) |
This is the fundamental tradeoff:
You cannot simultaneously minimize both bias and variance with a finite sample. Any bandwidth choice is a compromise. The 'optimal' bandwidth minimizes the total error (MSE = Bias² + Variance), accepting that neither component is individually minimized. This is a fundamental limitation of nonparametric estimation, not a deficiency of KDE.
Visualizing the Tradeoff:
Imagine a U-shaped curve of MSE as a function of bandwidth:
The optimal bandwidth sits at the bottom of this U-curve, where the marginal cost of increasing bias equals the marginal benefit of decreasing variance.
Mathematical formulation:
$$\text{MSE}(\hat{f}_h(x)) \approx \frac{h^4}{4} \left(f''(x)\right)^2 \mu_2(K)^2 + \frac{f(x) R(K)}{nh}$$
To find the optimal $h$, we differentiate with respect to $h$ and set to zero:
$$\frac{d}{dh} \text{MSE} = h^3 \left(f''(x)\right)^2 \mu_2(K)^2 - \frac{f(x) R(K)}{nh^2} = 0$$
While pointwise MSE is informative, we typically want a global measure of error over the entire density. The standard measure is the Mean Integrated Squared Error (MISE):
$$\text{MISE}(\hat{f}_h) = \mathbb{E}\left[\int \left(\hat{f}_h(x) - f(x)\right)^2 dx\right] = \int \text{MSE}(\hat{f}_h(x)) , dx$$
Integrating our pointwise expressions over $x$:
$$\text{MISE}(\hat{f}h) \approx \underbrace{\frac{h^4 \mu_2(K)^2}{4} \int \left(f''(x)\right)^2 dx}{\text{Integrated Bias}^2} + \underbrace{\frac{R(K)}{nh} \int f(x) , dx}_{\text{Integrated Variance}}$$
Since $\int f(x) dx = 1$ (normalization) and defining $R(f'') = \int (f''(x))^2 dx$ (roughness of the second derivative):
$$\text{MISE}(\hat{f}_h) \approx \frac{h^4 \mu_2(K)^2 R(f'')}{4} + \frac{R(K)}{nh}$$
This is the Asymptotic MISE (AMISE)—the leading-order approximation valid as $n \to \infty$ and $h \to 0$ with $nh \to \infty$.
The AMISE formula reveals what determines estimation difficulty: $R(f'')$ — Roughness of the density's second derivative. Densities with sharp features (high curvature) are harder to estimate because bias is larger for any given bandwidth. Smooth, gently curved densities are easier to estimate. $R(K)$ and $\mu_2(K)$ — Properties of the kernel. These are fixed once you choose a kernel. $n$ — Sample size. More data reduces variance but requires smaller bandwidth to reduce bias.
Deriving the Optimal Bandwidth:
To minimize AMISE with respect to $h$, we differentiate and set to zero:
$$\frac{d}{dh} \text{AMISE}(h) = h^3 \mu_2(K)^2 R(f'') - \frac{R(K)}{nh^2} = 0$$
Solving for $h$:
$$h^5 = \frac{R(K)}{n \mu_2(K)^2 R(f'')}$$
$$h_{\text{opt}} = \left(\frac{R(K)}{\mu_2(K)^2 R(f'')}\right)^{1/5} n^{-1/5}$$
Key observations about the optimal bandwidth:
Scales as $n^{-1/5}$: This is slower than the typical $n^{-1/2}$ rate in parametric statistics. More data helps, but with diminishing returns.
Depends on the unknown $R(f'')$: The optimal bandwidth requires knowing a property of the very density we're trying to estimate! This is the fundamental challenge of bandwidth selection.
Depends on kernel through $R(K)$ and $\mu_2(K)$: Different kernels have different optimal bandwidths, but can be converted by a scale factor.
| Kernel | $R(K)$ | $\mu_2(K)$ | Constant $C_K$ |
|---|---|---|---|
| Gaussian | 0.2821 | 1.0000 | 1.0592 |
| Epanechnikov | 0.6000 | 0.2000 | 2.3449 |
| Biweight | 0.7143 | 0.1429 | 2.7779 |
| Triweight | 0.8159 | 0.1111 | 3.1546 |
| Uniform | 0.5000 | 0.3333 | 1.7321 |
Substituting the optimal bandwidth back into the AMISE expression gives us the optimal achievable error. This reveals the fundamental rate at which KDE can estimate densities.
Substituting $h_{\text{opt}}$ into AMISE:
Let $h_{\text{opt}} = C \cdot n^{-1/5}$ where $C = \left(\frac{R(K)}{\mu_2(K)^2 R(f'')}\right)^{1/5}$.
Then:
$$\text{AMISE}(h_{\text{opt}}) = \frac{(C n^{-1/5})^4 \mu_2(K)^2 R(f'')}{4} + \frac{R(K)}{n \cdot C n^{-1/5}}$$
$$= \frac{C^4 \mu_2(K)^2 R(f'')}{4} n^{-4/5} + \frac{R(K)}{C} n^{-4/5}$$
After simplification (and noting that both terms contribute equally at the optimum):
$$\text{AMISE}(h_{\text{opt}}) = \frac{5}{4} \left(\mu_2(K)^2 R(K)\right)^{4/5} \left(R(f'')\right)^{1/5} n^{-4/5}$$
The fundamental result: The optimal MISE decreases at rate $n^{-4/5}$.
The $n^{-4/5}$ rate is fundamentally slower than the $n^{-1}$ rate achievable by parametric estimators when the model is correctly specified. This is the price of nonparametric flexibility—we make no assumptions about the form of the density, but convergence is slower. To achieve the same error as a parametric estimator with $n$ observations, we need roughly $n^{5/4}$ observations.
Rate Comparison:
| Estimation Method | Optimal Rate | Required $n$ for error $\epsilon$ |
|---|---|---|
| Parametric (correct model) | $O(n^{-1})$ | $O(1/\epsilon)$ |
| KDE (univariate) | $O(n^{-4/5})$ | $O(1/\epsilon^{5/4})$ |
| KDE (bivariate) | $O(n^{-4/6})$ | $O(1/\epsilon^{6/4})$ |
| KDE ($d$-dimensional) | $O(n^{-4/(4+d)})$ | $O(1/\epsilon^{(4+d)/4})$ |
Interpreting the rate:
Why $n^{-4/5}$?
The exponent $-4/5$ arises from balancing bias (which decays as $h^2 \sim n^{-2/5}$) against variance (which decays as $(nh)^{-1} \sim n^{-4/5}$). The optimal balance gives bias² and variance both proportional to $n^{-4/5}$.
Understanding how sample size affects the bias-variance tradeoff provides crucial intuition for practitioners.
As $n$ increases:
Optimal bandwidth decreases: $h_{\text{opt}} \propto n^{-1/5}$. With more data, we can afford narrower kernels without excessive variance.
Variance decreases faster than bias: At the optimal $h$, both bias² and variance scale as $n^{-4/5}$, but if we fix $h$, variance decreases as $n^{-1}$ while bias remains constant.
More features become resolvable: Finer details in the density (smaller bumps, sharper peaks) become detectable as we can use smaller bandwidths.
The U-curve shifts: The minimum of the bias-variance curve moves left (smaller $h$) and down (lower total error).
Practical implications:
| Sample Size $n$ | Relative $h_{\text{opt}}$ | Relative MISE |
|---|---|---|
| 50 | 1.00 | 1.00 |
| 100 | 0.87 | 0.76 |
| 500 | 0.66 | 0.44 |
| 1,000 | 0.57 | 0.34 |
| 5,000 | 0.44 | 0.20 |
| 10,000 | 0.38 | 0.15 |
| 100,000 | 0.24 | 0.06 |
Doubling the sample size reduces the optimal bandwidth by about 15% (factor of $2^{-1/5} \approx 0.87$) and reduces the MISE by about 24% (factor of $2^{-4/5} \approx 0.76$). Keep this in mind when deciding how much data to collect for density estimation tasks.
The roughness of the true density, $R(f'') = \int (f''(x))^2 dx$, plays a critical role in determining how well we can estimate it. This quantity captures how 'wiggly' the density is.
Smooth densities (low $R(f'')$):
Rough densities (high $R(f'')$):
Reference values of $R(f'')$:
| Distribution | $R(f'')$ | Relative Difficulty |
|---|---|---|
| Normal(0, 1) | $3/(8\sqrt{\pi}) \approx 0.2120$ | Easy (baseline) |
| Uniform(-1, 1) | $0$ (boundary issues) | Very easy in interior |
| Laplace(0, 1) | $1/4 = 0.25$ | Easy |
| Mixture of 2 close Normals | $\sim 0.5 - 2$ | Moderate |
| Mixture of 5 Normals | $\sim 1 - 5$ | Difficult |
| Claw density | $\sim 4.2$ | Very difficult |
Here we encounter a fundamental challenge: to choose the optimal bandwidth, we need to know $R(f'')$—but $f$ is exactly what we're trying to estimate! This circularity is why bandwidth selection is non-trivial and has spawned an entire sub-literature of methods (plug-in, cross-validation, etc.) that we'll explore in the next section.
Visual intuition:
Think of density estimation as fitting a flexible sheet to a surface:
Smooth density (low $R(f'')$): Like fitting a sheet to gently rolling hills. The sheet doesn't need to bend sharply, so even if we use a somewhat stiff sheet (large bandwidth), we get a good fit.
Rough density (high $R(f'')$): Like fitting a sheet to a mountainous terrain with sharp peaks and valleys. We need a very flexible sheet (small bandwidth) to capture the features, but this flexibility makes the sheet more susceptible to noise (data scatter).
The sample size determines how much noise we have to contend with. More data means less noise, allowing us to use a more flexible sheet without being overwhelmed by random fluctuations.
So far, we've analyzed the global tradeoff (MISE over the entire domain). However, the bias-variance balance can vary significantly across the domain of the density.
Regional variation:
Recall the pointwise expressions:
$$\text{Bias}^2(x) \propto (f''(x))^2$$ $$\text{Var}(x) \propto f(x)$$
This means:
In high-density regions: Variance is larger (more data, but also more variability), bias depends on local curvature.
In low-density regions (tails): Variance is smaller in absolute terms but potentially larger relative to the true density. Bias can be significant if the density is changing shape.
Near modes: Often high curvature ($f''$ large) means higher bias; also high density means higher variance.
In valleys between modes: Curvature can be high (concave up), causing bias; density is low, so variance is lower.
The inadequacy of global bandwidth:
A single global bandwidth cannot optimally balance the tradeoff everywhere. The AMISE-optimal bandwidth is a compromise—it may undersmoooth in some regions and oversmooth in others.
This observation motivates variable bandwidth methods (covered in a later section) where $h$ adapts to local conditions.
In the tails of the distribution, both problems can occur simultaneously: (1) Few data points means high relative variance, and (2) The density may have high curvature as it decays. Global bandwidth methods often perform poorly in tails, either producing spurious bumps (undersmoothing) or missing the true tail decay rate (oversmoothing).
Integrated vs. pointwise optimization:
| Approach | Optimizes | Good for |
|---|---|---|
| Global MISE-optimal $h$ | Overall average performance | General-purpose estimation |
| Local pointwise optimal $h(x)$ | Performance at specific location | Understanding local behavior |
| Variable bandwidth $h(x_i)$ | Adapts to local data density | Densities with varying complexity |
| Adaptive estimation | Compromise between local adaptation and stability | Difficult densities |
In practice, most applications use a global bandwidth for simplicity, but awareness of local tradeoffs helps interpret estimates critically—especially in regions of high curvature or low density.
The theoretical analysis of bias-variance tradeoff has immediate practical implications for how we approach density estimation.
Key practical lessons:
Don't expect perfection: The nonparametric rate $n^{-4/5}$ is fundamental; no bandwidth selection method can do better asymptotically. Accept that estimates are approximations.
Sample size matters a lot: Because of the $n^{-1/5}$ scaling, you need substantial sample sizes for accurate estimation. With $n = 50$, expect noticeable errors; with $n = 5000$, expect reasonably accurate estimates.
Undersmoothing is often worse than oversmoothing: Excessive variance (undersmoothing) produces misleading spurious features; excessive bias (oversmoothing) may miss features but is less misleading. When in doubt, err slightly on the side of smoothing.
Visual inspection is essential: Plot the estimate with several bandwidth values. If conclusions change dramatically with bandwidth, they're not robust and more data is needed.
Context matters for bandwidth choice: The 'optimal' bandwidth depends on what question you're answering. For finding modes, smaller bandwidths may be better. For overall shape, larger bandwidths may suffice.
We've established the theoretical foundation of the bias-variance tradeoff. In the next section, we'll dive into the practical methods for actually selecting the bandwidth—turning this theory into actionable procedures that work with real data.
The bias-variance tradeoff is the theoretical bedrock upon which all bandwidth selection methods are built. Let's consolidate the key insights:
You now understand why bandwidth selection is challenging—the optimal bandwidth depends on the unknown density we're trying to estimate. The next page will explore the practical methods for solving this problem: rules of thumb, plug-in estimators, and cross-validation approaches that have become the standard tools for KDE practitioners.