Kde Theory - Learning Module

Loading content...

0/278

Bias-Variance Tradeoff

The Central Tension in Density Estimation

In the previous page, we studied kernel functions—the building blocks that determine the shape of local contributions in kernel density estimation. We saw that the choice of kernel has surprisingly little effect on estimation accuracy. So what does matter?

The answer is the bandwidth parameter $h$. The bandwidth controls the width of each kernel contribution, and its choice fundamentally determines the quality of the density estimate. But here's the critical insight: there is no universally 'best' bandwidth. Instead, we face a fundamental bias-variance tradeoff that governs all nonparametric estimation.

This tradeoff is not merely a technical detail—it is the theoretical foundation upon which all bandwidth selection methods are built. Understanding it deeply is essential for anyone who wants to do more than blindly apply default settings.

What You Will Learn

By the end of this page, you will understand the mathematical decomposition of Mean Squared Error into bias and variance components, how bandwidth affects each component in opposing ways, the asymptotic expressions for bias and variance, how to combine these into Mean Integrated Squared Error (MISE), the optimal bandwidth derivation and its dependence on the unknown density, and the practical implications for KDE applications.

Pointwise Error Decomposition

Let's begin by analyzing the error of our kernel density estimator at a single point $x$. Recall that given i.i.d. observations $X_1, \ldots, X_n$ from an unknown density $f$, the KDE is:

$$\hat{f}h(x) = \frac{1}{nh} \sum{i=1}^{n} K\left(\frac{x - X_i}{h}\right)$$

The natural measure of error at point $x$ is the Mean Squared Error (MSE):

$$\text{MSE}(\hat{f}_h(x)) = \mathbb{E}\left[(\hat{f}_h(x) - f(x))^2\right]$$

The Bias-Variance Decomposition:

This MSE can be decomposed into two fundamental components:

$$\text{MSE}(\hat{f}_h(x)) = \underbrace{\left(\mathbb{E}[\hat{f}h(x)] - f(x)\right)^2}{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}h(x))}{\text{Variance}}$$

This is the famous bias-variance decomposition. Let's understand each term.

Understanding the Decomposition

Bias measures systematic error—how far the average estimate is from the truth. If you repeated the estimation infinitely many times with different samples, bias is the gap between the average of your estimates and the true value. Variance measures random error—how much individual estimates fluctuate around their average. High variance means estimates are unstable and sensitive to the particular sample.

Deriving the Bias:

The expected value of the KDE at point $x$ is:

$$\mathbb{E}[\hat{f}_h(x)] = \frac{1}{h} \mathbb{E}\left[K\left(\frac{x - X_1}{h}\right)\right] = \frac{1}{h} \int K\left(\frac{x - t}{h}\right) f(t) , dt$$

Using the substitution $u = (x - t)/h$:

$$\mathbb{E}[\hat{f}_h(x)] = \int K(u) f(x - hu) , du$$

Now, we perform a Taylor expansion of $f(x - hu)$ around $x$:

$$f(x - hu) = f(x) - hu f'(x) + \frac{(hu)^2}{2} f''(x) - \frac{(hu)^3}{6} f'''(x) + O(h^4)$$

Substituting and using the properties of the kernel ($\int K(u) du = 1$, $\int u K(u) du = 0$ for symmetric kernels):

$$\mathbb{E}[\hat{f}_h(x)] = f(x) + \frac{h^2}{2} f''(x) \mu_2(K) + O(h^4)$$

where $\mu_2(K) = \int u^2 K(u) du$ is the second moment of the kernel.

Therefore, the bias is:

$$\text{Bias}(\hat{f}_h(x)) = \mathbb{E}[\hat{f}_h(x)] - f(x) = \frac{h^2}{2} f''(x) \mu_2(K) + O(h^4)$$

Key Insight: Bias Depends on Curvature

The bias is proportional to $f''(x)$—the curvature of the true density at point $x$. Where the density is highly curved (sharp peaks or valleys), the bias is large. Where the density is nearly linear (low curvature), the bias is small. This makes intuitive sense: averaging over a neighborhood introduces more error when the function is changing rapidly.

Deriving the Variance:

The KDE is an average of $n$ independent and identically distributed random variables:

$$\hat{f}h(x) = \frac{1}{n} \sum{i=1}^{n} \underbrace{\frac{1}{h} K\left(\frac{x - X_i}{h}\right)}_{Y_i}$$

For i.i.d. random variables, $\text{Var}(\bar{Y}) = \text{Var}(Y_1)/n$. So:

$$\text{Var}(\hat{f}_h(x)) = \frac{1}{n} \text{Var}\left(\frac{1}{h} K\left(\frac{x - X_1}{h}\right)\right)$$

Computing this variance:

$$\text{Var}(\hat{f}_h(x)) = \frac{1}{n} \left[\mathbb{E}\left(\frac{1}{h^2} K^2\left(\frac{x - X_1}{h}\right)\right) - \left(\mathbb{E}\left(\frac{1}{h} K\left(\frac{x - X_1}{h}\right)\right)\right)^2\right]$$

For large $n$ (small $h$), the second term is $O(1)$ but the first term has a factor of $1/h$:

$$\mathbb{E}\left(\frac{1}{h^2} K^2\left(\frac{x - X_1}{h}\right)\right) = \frac{1}{h^2} \int K^2\left(\frac{x - t}{h}\right) f(t) , dt = \frac{1}{h} \int K^2(u) f(x - hu) , du$$

$$\approx \frac{f(x)}{h} \int K^2(u) , du = \frac{f(x) R(K)}{h}$$

where $R(K) = \int K^2(u) du$ is the roughness of the kernel.

Therefore, the variance is:

$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$

The Fundamental Tension

Now we can see the core of the bias-variance tradeoff. Collecting our asymptotic expressions:

$$\text{Bias}^2(\hat{f}_h(x)) \approx \frac{h^4}{4} \left(f''(x)\right)^2 \mu_2(K)^2$$

$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$

The tension becomes crystal clear:

Component	Dependence on $h$	Effect of increasing $h$
Bias²	$\propto h^4$	Increases (overly smooth)
Variance	$\propto 1/(nh)$	Decreases (more stable)

This is the fundamental tradeoff:

Small bandwidth ($h \downarrow$): Low bias (captures local features) but high variance (noisy, wiggly estimates)
Large bandwidth ($h \uparrow$): Low variance (smooth, stable estimates) but high bias (oversmooths, loses features)

Small Bandwidth (Undersmoothing)

•Visual appearance: Spiky, noisy estimate
•Each kernel: Very narrow, captures only nearby points
•Local behavior: Follows individual data points too closely
•Statistical issue: High variance from small effective sample size
•Extreme case: Each point is its own spike (histogram with tiny bins)

Large Bandwidth (Oversmoothing)

•Visual appearance: Overly smooth, blob-like
•Each kernel: Very wide, averages distant points together
•Local behavior: Washes out important features (modes, valleys)
•Statistical issue: High bias from inappropriate local averaging
•Extreme case: Everything averaged to a single broad hump

There Is No Free Lunch

You cannot simultaneously minimize both bias and variance with a finite sample. Any bandwidth choice is a compromise. The 'optimal' bandwidth minimizes the total error (MSE = Bias² + Variance), accepting that neither component is individually minimized. This is a fundamental limitation of nonparametric estimation, not a deficiency of KDE.

Visualizing the Tradeoff:

Imagine a U-shaped curve of MSE as a function of bandwidth:

Left side (small $h$): MSE is high due to variance dominating
Right side (large $h$): MSE is high due to bias dominating
Middle (optimal $h$): MSE is minimized where bias² and variance are balanced

The optimal bandwidth sits at the bottom of this U-curve, where the marginal cost of increasing bias equals the marginal benefit of decreasing variance.

Mathematical formulation:

$$\text{MSE}(\hat{f}_h(x)) \approx \frac{h^4}{4} \left(f''(x)\right)^2 \mu_2(K)^2 + \frac{f(x) R(K)}{nh}$$

To find the optimal $h$, we differentiate with respect to $h$ and set to zero:

$$\frac{d}{dh} \text{MSE} = h^3 \left(f''(x)\right)^2 \mu_2(K)^2 - \frac{f(x) R(K)}{nh^2} = 0$$

Mean Integrated Squared Error (MISE)

While pointwise MSE is informative, we typically want a global measure of error over the entire density. The standard measure is the Mean Integrated Squared Error (MISE):

$$\text{MISE}(\hat{f}_h) = \mathbb{E}\left[\int \left(\hat{f}_h(x) - f(x)\right)^2 dx\right] = \int \text{MSE}(\hat{f}_h(x)) , dx$$

Integrating our pointwise expressions over $x$:

$$\text{MISE}(\hat{f}h) \approx \underbrace{\frac{h^4 \mu_2(K)^2}{4} \int \left(f''(x)\right)^2 dx}{\text{Integrated Bias}^2} + \underbrace{\frac{R(K)}{nh} \int f(x) , dx}_{\text{Integrated Variance}}$$

Since $\int f(x) dx = 1$ (normalization) and defining $R(f'') = \int (f''(x))^2 dx$ (roughness of the second derivative):

$$\text{MISE}(\hat{f}_h) \approx \frac{h^4 \mu_2(K)^2 R(f'')}{4} + \frac{R(K)}{nh}$$

This is the Asymptotic MISE (AMISE)—the leading-order approximation valid as $n \to \infty$ and $h \to 0$ with $nh \to \infty$.

Interpreting AMISE

The AMISE formula reveals what determines estimation difficulty: $R(f'')$ — Roughness of the density's second derivative. Densities with sharp features (high curvature) are harder to estimate because bias is larger for any given bandwidth. Smooth, gently curved densities are easier to estimate. $R(K)$ and $\mu_2(K)$ — Properties of the kernel. These are fixed once you choose a kernel. $n$ — Sample size. More data reduces variance but requires smaller bandwidth to reduce bias.

Deriving the Optimal Bandwidth:

To minimize AMISE with respect to $h$, we differentiate and set to zero:

$$\frac{d}{dh} \text{AMISE}(h) = h^3 \mu_2(K)^2 R(f'') - \frac{R(K)}{nh^2} = 0$$

Solving for $h$:

$$h^5 = \frac{R(K)}{n \mu_2(K)^2 R(f'')}$$

$$h_{\text{opt}} = \left(\frac{R(K)}{\mu_2(K)^2 R(f'')}\right)^{1/5} n^{-1/5}$$

Key observations about the optimal bandwidth:

Scales as $n^{-1/5}$: This is slower than the typical $n^{-1/2}$ rate in parametric statistics. More data helps, but with diminishing returns.
Depends on the unknown $R(f'')$: The optimal bandwidth requires knowing a property of the very density we're trying to estimate! This is the fundamental challenge of bandwidth selection.
Depends on kernel through $R(K)$ and $\mu_2(K)$: Different kernels have different optimal bandwidths, but can be converted by a scale factor.

AMISE-Optimal Bandwidth Constants for Common Kernels
Kernel	$R(K)$	$\mu_2(K)$	Constant $C_K$
Gaussian	0.2821	1.0000	1.0592
Epanechnikov	0.6000	0.2000	2.3449
Biweight	0.7143	0.1429	2.7779
Triweight	0.8159	0.1111	3.1546
Uniform	0.5000	0.3333	1.7321

Optimal Convergence Rate

Substituting the optimal bandwidth back into the AMISE expression gives us the optimal achievable error. This reveals the fundamental rate at which KDE can estimate densities.

Substituting $h_{\text{opt}}$ into AMISE:

Let $h_{\text{opt}} = C \cdot n^{-1/5}$ where $C = \left(\frac{R(K)}{\mu_2(K)^2 R(f'')}\right)^{1/5}$.

Then:

$$\text{AMISE}(h_{\text{opt}}) = \frac{(C n^{-1/5})^4 \mu_2(K)^2 R(f'')}{4} + \frac{R(K)}{n \cdot C n^{-1/5}}$$

$$= \frac{C^4 \mu_2(K)^2 R(f'')}{4} n^{-4/5} + \frac{R(K)}{C} n^{-4/5}$$

After simplification (and noting that both terms contribute equally at the optimum):

$$\text{AMISE}(h_{\text{opt}}) = \frac{5}{4} \left(\mu_2(K)^2 R(K)\right)^{4/5} \left(R(f'')\right)^{1/5} n^{-4/5}$$

The fundamental result: The optimal MISE decreases at rate $n^{-4/5}$.

Slower Than Parametric Rates

The $n^{-4/5}$ rate is fundamentally slower than the $n^{-1}$ rate achievable by parametric estimators when the model is correctly specified. This is the price of nonparametric flexibility—we make no assumptions about the form of the density, but convergence is slower. To achieve the same error as a parametric estimator with $n$ observations, we need roughly $n^{5/4}$ observations.

Rate Comparison:

Estimation Method	Optimal Rate	Required $n$ for error $\epsilon$
Parametric (correct model)	$O(n^{-1})$	$O(1/\epsilon)$
KDE (univariate)	$O(n^{-4/5})$	$O(1/\epsilon^{5/4})$
KDE (bivariate)	$O(n^{-4/6})$	$O(1/\epsilon^{6/4})$
KDE ($d$-dimensional)	$O(n^{-4/(4+d)})$	$O(1/\epsilon^{(4+d)/4})$

Interpreting the rate:

To halve the MISE, you need approximately $2^{5/4} \approx 2.38$ times more data (vs. 2× for parametric).
To reduce error by a factor of 10, you need $10^{5/4} \approx 17.8$ times more data.
In high dimensions, the rate deteriorates further—this is the curse of dimensionality.

Why $n^{-4/5}$?

The exponent $-4/5$ arises from balancing bias (which decays as $h^2 \sim n^{-2/5}$) against variance (which decays as $(nh)^{-1} \sim n^{-4/5}$). The optimal balance gives bias² and variance both proportional to $n^{-4/5}$.

Effect of Sample Size on the Tradeoff

Understanding how sample size affects the bias-variance tradeoff provides crucial intuition for practitioners.

As $n$ increases:

Optimal bandwidth decreases: $h_{\text{opt}} \propto n^{-1/5}$. With more data, we can afford narrower kernels without excessive variance.
Variance decreases faster than bias: At the optimal $h$, both bias² and variance scale as $n^{-4/5}$, but if we fix $h$, variance decreases as $n^{-1}$ while bias remains constant.
More features become resolvable: Finer details in the density (smaller bumps, sharper peaks) become detectable as we can use smaller bandwidths.
The U-curve shifts: The minimum of the bias-variance curve moves left (smaller $h$) and down (lower total error).

Practical implications:

With small samples ($n < 50$): Expect significant smoothing bias; fine features will be obscured.
With moderate samples ($50 < n < 500$): Can capture major modes and overall shape.
With large samples ($n > 500$): Can detect secondary features, shoulders, and subtle asymmetries.
With very large samples ($n > 10000$): Fine structure becomes visible; computational efficiency may become the limiting factor.

Expected Optimal Bandwidth Scaling (Gaussian Kernel, $R(f'')=1$)
Sample Size $n$	Relative $h_{\text{opt}}$	Relative MISE
50	1.00	1.00
100	0.87	0.76
500	0.66	0.44
1,000	0.57	0.34
5,000	0.44	0.20
10,000	0.38	0.15
100,000	0.24	0.06

Rule of Thumb

Doubling the sample size reduces the optimal bandwidth by about 15% (factor of $2^{-1/5} \approx 0.87$) and reduces the MISE by about 24% (factor of $2^{-4/5} \approx 0.76$). Keep this in mind when deciding how much data to collect for density estimation tasks.

Effect of Density Shape on the Tradeoff

The roughness of the true density, $R(f'') = \int (f''(x))^2 dx$, plays a critical role in determining how well we can estimate it. This quantity captures how 'wiggly' the density is.

Smooth densities (low $R(f'')$):

Examples: Normal distribution, uniform distribution, slowly varying mixtures
Allow larger optimal bandwidths (more smoothing is safe)
Achieve lower MISE for given sample size
Easier to estimate with good accuracy

Rough densities (high $R(f'')$):

Examples: Sharp multimodal distributions, densities with spikes, heavily skewed
Require smaller optimal bandwidths (must preserve fine structure)
Achieve higher MISE for given sample size
Harder to estimate accurately

Reference values of $R(f'')$:

Distribution	$R(f'')$	Relative Difficulty
Normal(0, 1)	$3/(8\sqrt{\pi}) \approx 0.2120$	Easy (baseline)
Uniform(-1, 1)	$0$ (boundary issues)	Very easy in interior
Laplace(0, 1)	$1/4 = 0.25$	Easy
Mixture of 2 close Normals	$\sim 0.5 - 2$	Moderate
Mixture of 5 Normals	$\sim 1 - 5$	Difficult
Claw density	$\sim 4.2$	Very difficult

The Circular Problem

Here we encounter a fundamental challenge: to choose the optimal bandwidth, we need to know $R(f'')$—but $f$ is exactly what we're trying to estimate! This circularity is why bandwidth selection is non-trivial and has spawned an entire sub-literature of methods (plug-in, cross-validation, etc.) that we'll explore in the next section.

Visual intuition:

Think of density estimation as fitting a flexible sheet to a surface:

Smooth density (low $R(f'')$): Like fitting a sheet to gently rolling hills. The sheet doesn't need to bend sharply, so even if we use a somewhat stiff sheet (large bandwidth), we get a good fit.
Rough density (high $R(f'')$): Like fitting a sheet to a mountainous terrain with sharp peaks and valleys. We need a very flexible sheet (small bandwidth) to capture the features, but this flexibility makes the sheet more susceptible to noise (data scatter).

The sample size determines how much noise we have to contend with. More data means less noise, allowing us to use a more flexible sheet without being overwhelmed by random fluctuations.

Local vs. Global Tradeoffs

So far, we've analyzed the global tradeoff (MISE over the entire domain). However, the bias-variance balance can vary significantly across the domain of the density.

Regional variation:

Recall the pointwise expressions:

$$\text{Bias}^2(x) \propto (f''(x))^2$$ $$\text{Var}(x) \propto f(x)$$

This means:

In high-density regions: Variance is larger (more data, but also more variability), bias depends on local curvature.
In low-density regions (tails): Variance is smaller in absolute terms but potentially larger relative to the true density. Bias can be significant if the density is changing shape.
Near modes: Often high curvature ($f''$ large) means higher bias; also high density means higher variance.
In valleys between modes: Curvature can be high (concave up), causing bias; density is low, so variance is lower.

The inadequacy of global bandwidth:

A single global bandwidth cannot optimally balance the tradeoff everywhere. The AMISE-optimal bandwidth is a compromise—it may undersmoooth in some regions and oversmooth in others.

This observation motivates variable bandwidth methods (covered in a later section) where $h$ adapts to local conditions.

Tail Behavior

In the tails of the distribution, both problems can occur simultaneously: (1) Few data points means high relative variance, and (2) The density may have high curvature as it decays. Global bandwidth methods often perform poorly in tails, either producing spurious bumps (undersmoothing) or missing the true tail decay rate (oversmoothing).

Integrated vs. pointwise optimization:

Approach	Optimizes	Good for
Global MISE-optimal $h$	Overall average performance	General-purpose estimation
Local pointwise optimal $h(x)$	Performance at specific location	Understanding local behavior
Variable bandwidth $h(x_i)$	Adapts to local data density	Densities with varying complexity
Adaptive estimation	Compromise between local adaptation and stability	Difficult densities

In practice, most applications use a global bandwidth for simplicity, but awareness of local tradeoffs helps interpret estimates critically—especially in regions of high curvature or low density.

Practical Implications for Bandwidth Selection

The theoretical analysis of bias-variance tradeoff has immediate practical implications for how we approach density estimation.

Key practical lessons:

Don't expect perfection: The nonparametric rate $n^{-4/5}$ is fundamental; no bandwidth selection method can do better asymptotically. Accept that estimates are approximations.
Sample size matters a lot: Because of the $n^{-1/5}$ scaling, you need substantial sample sizes for accurate estimation. With $n = 50$, expect noticeable errors; with $n = 5000$, expect reasonably accurate estimates.
Undersmoothing is often worse than oversmoothing: Excessive variance (undersmoothing) produces misleading spurious features; excessive bias (oversmoothing) may miss features but is less misleading. When in doubt, err slightly on the side of smoothing.
Visual inspection is essential: Plot the estimate with several bandwidth values. If conclusions change dramatically with bandwidth, they're not robust and more data is needed.
Context matters for bandwidth choice: The 'optimal' bandwidth depends on what question you're answering. For finding modes, smaller bandwidths may be better. For overall shape, larger bandwidths may suffice.

Bandwidth Selection Strategies

•Rule of thumb (Silverman/Scott): Quick, based on normal reference; works well for unimodal, roughly symmetric densities.
•Plug-in methods: Estimate $R(f'')$ from the data, then compute optimal $h$; good balance of speed and accuracy.
•Cross-validation (LSCV, BCV): Data-driven selection minimizing estimated error; more robust to non-normality but higher variance.
•Subjective selection: Try multiple bandwidths, select based on domain knowledge and goals; reasonable for exploratory analysis.
•Sheather-Jones method: Solves a plug-in equation iteratively; often considered the best automatic choice.

The Next Step

We've established the theoretical foundation of the bias-variance tradeoff. In the next section, we'll dive into the practical methods for actually selecting the bandwidth—turning this theory into actionable procedures that work with real data.

Summary: Bias-Variance Tradeoff

The bias-variance tradeoff is the theoretical bedrock upon which all bandwidth selection methods are built. Let's consolidate the key insights:

Key Takeaways

•MSE = Bias² + Variance — This fundamental decomposition governs the error of all estimators.
•Bias scales as $h^2$, Variance scales as $1/(nh)$ — These opposing dependencies create the fundamental tension.
•Small $h$ → low bias, high variance — Estimates are wiggly and follow noise.
•Large $h$ → high bias, low variance — Estimates are smooth but may miss features.
•AMISE optimal bandwidth scales as $n^{-1/5}$ — This rate is slower than parametric estimation.
•Optimal MISE converges at rate $n^{-4/5}$ — The fundamental nonparametric rate for KDE.
•Optimal $h$ depends on unknown $R(f'')$ — This circularity motivates bandwidth selection methods.
•Local and global tradeoffs can differ — A single bandwidth is a compromise across the domain.

Coming Next: Bandwidth Selection Methods

You now understand why bandwidth selection is challenging—the optimal bandwidth depends on the unknown density we're trying to estimate. The next page will explore the practical methods for solving this problem: rules of thumb, plug-in estimators, and cross-validation approaches that have become the standard tools for KDE practitioners.

Bias-Variance Tradeoff

The Central Tension in Density Estimation

What You Will Learn

Pointwise Error Decomposition

Let's begin by analyzing the error of our kernel density estimator at a single point $x$. Recall that given i.i.d. observations $X_1, \ldots, X_n$ from an unknown density $f$, the KDE is:

$$\hat{f}h(x) = \frac{1}{nh} \sum{i=1}^{n} K\left(\frac{x - X_i}{h}\right)$$

The natural measure of error at point $x$ is the Mean Squared Error (MSE):

$$\text{MSE}(\hat{f}_h(x)) = \mathbb{E}\left[(\hat{f}_h(x) - f(x))^2\right]$$

The Bias-Variance Decomposition:

This MSE can be decomposed into two fundamental components:

$$\text{MSE}(\hat{f}_h(x)) = \underbrace{\left(\mathbb{E}[\hat{f}h(x)] - f(x)\right)^2}{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}h(x))}{\text{Variance}}$$

This is the famous bias-variance decomposition. Let's understand each term.

Understanding the Decomposition

Deriving the Bias:

The expected value of the KDE at point $x$ is:

$$\mathbb{E}[\hat{f}_h(x)] = \frac{1}{h} \mathbb{E}\left[K\left(\frac{x - X_1}{h}\right)\right] = \frac{1}{h} \int K\left(\frac{x - t}{h}\right) f(t) , dt$$

Using the substitution $u = (x - t)/h$:

$$\mathbb{E}[\hat{f}_h(x)] = \int K(u) f(x - hu) , du$$

Now, we perform a Taylor expansion of $f(x - hu)$ around $x$:

$$f(x - hu) = f(x) - hu f'(x) + \frac{(hu)^2}{2} f''(x) - \frac{(hu)^3}{6} f'''(x) + O(h^4)$$

Substituting and using the properties of the kernel ($\int K(u) du = 1$, $\int u K(u) du = 0$ for symmetric kernels):

$$\mathbb{E}[\hat{f}_h(x)] = f(x) + \frac{h^2}{2} f''(x) \mu_2(K) + O(h^4)$$

where $\mu_2(K) = \int u^2 K(u) du$ is the second moment of the kernel.

Therefore, the bias is:

$$\text{Bias}(\hat{f}_h(x)) = \mathbb{E}[\hat{f}_h(x)] - f(x) = \frac{h^2}{2} f''(x) \mu_2(K) + O(h^4)$$

Key Insight: Bias Depends on Curvature

Deriving the Variance:

The KDE is an average of $n$ independent and identically distributed random variables:

$$\hat{f}h(x) = \frac{1}{n} \sum{i=1}^{n} \underbrace{\frac{1}{h} K\left(\frac{x - X_i}{h}\right)}_{Y_i}$$

For i.i.d. random variables, $\text{Var}(\bar{Y}) = \text{Var}(Y_1)/n$. So:

$$\text{Var}(\hat{f}_h(x)) = \frac{1}{n} \text{Var}\left(\frac{1}{h} K\left(\frac{x - X_1}{h}\right)\right)$$

Computing this variance:

For large $n$ (small $h$), the second term is $O(1)$ but the first term has a factor of $1/h$:

$$\mathbb{E}\left(\frac{1}{h^2} K^2\left(\frac{x - X_1}{h}\right)\right) = \frac{1}{h^2} \int K^2\left(\frac{x - t}{h}\right) f(t) , dt = \frac{1}{h} \int K^2(u) f(x - hu) , du$$

$$\approx \frac{f(x)}{h} \int K^2(u) , du = \frac{f(x) R(K)}{h}$$

where $R(K) = \int K^2(u) du$ is the roughness of the kernel.

Therefore, the variance is:

$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$

The Fundamental Tension

Now we can see the core of the bias-variance tradeoff. Collecting our asymptotic expressions:

$$\text{Bias}^2(\hat{f}_h(x)) \approx \frac{h^4}{4} \left(f''(x)\right)^2 \mu_2(K)^2$$

$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$

The tension becomes crystal clear:

Component	Dependence on $h$	Effect of increasing $h$
Bias²	$\propto h^4$	Increases (overly smooth)
Variance	$\propto 1/(nh)$	Decreases (more stable)

This is the fundamental tradeoff:

Small bandwidth ($h \downarrow$): Low bias (captures local features) but high variance (noisy, wiggly estimates)
Large bandwidth ($h \uparrow$): Low variance (smooth, stable estimates) but high bias (oversmooths, loses features)

Small Bandwidth (Undersmoothing)

•Visual appearance: Spiky, noisy estimate
•Each kernel: Very narrow, captures only nearby points
•Local behavior: Follows individual data points too closely
•Statistical issue: High variance from small effective sample size
•Extreme case: Each point is its own spike (histogram with tiny bins)

Large Bandwidth (Oversmoothing)

•Visual appearance: Overly smooth, blob-like
•Each kernel: Very wide, averages distant points together
•Local behavior: Washes out important features (modes, valleys)
•Statistical issue: High bias from inappropriate local averaging
•Extreme case: Everything averaged to a single broad hump

There Is No Free Lunch

Visualizing the Tradeoff:

Imagine a U-shaped curve of MSE as a function of bandwidth:

Left side (small $h$): MSE is high due to variance dominating
Right side (large $h$): MSE is high due to bias dominating
Middle (optimal $h$): MSE is minimized where bias² and variance are balanced

The optimal bandwidth sits at the bottom of this U-curve, where the marginal cost of increasing bias equals the marginal benefit of decreasing variance.

Mathematical formulation:

$$\text{MSE}(\hat{f}_h(x)) \approx \frac{h^4}{4} \left(f''(x)\right)^2 \mu_2(K)^2 + \frac{f(x) R(K)}{nh}$$

To find the optimal $h$, we differentiate with respect to $h$ and set to zero:

$$\frac{d}{dh} \text{MSE} = h^3 \left(f''(x)\right)^2 \mu_2(K)^2 - \frac{f(x) R(K)}{nh^2} = 0$$

Mean Integrated Squared Error (MISE)

While pointwise MSE is informative, we typically want a global measure of error over the entire density. The standard measure is the Mean Integrated Squared Error (MISE):

$$\text{MISE}(\hat{f}_h) = \mathbb{E}\left[\int \left(\hat{f}_h(x) - f(x)\right)^2 dx\right] = \int \text{MSE}(\hat{f}_h(x)) , dx$$

Integrating our pointwise expressions over $x$:

Since $\int f(x) dx = 1$ (normalization) and defining $R(f'') = \int (f''(x))^2 dx$ (roughness of the second derivative):

$$\text{MISE}(\hat{f}_h) \approx \frac{h^4 \mu_2(K)^2 R(f'')}{4} + \frac{R(K)}{nh}$$

This is the Asymptotic MISE (AMISE)—the leading-order approximation valid as $n \to \infty$ and $h \to 0$ with $nh \to \infty$.

Interpreting AMISE

Deriving the Optimal Bandwidth:

To minimize AMISE with respect to $h$, we differentiate and set to zero:

$$\frac{d}{dh} \text{AMISE}(h) = h^3 \mu_2(K)^2 R(f'') - \frac{R(K)}{nh^2} = 0$$

Solving for $h$:

$$h^5 = \frac{R(K)}{n \mu_2(K)^2 R(f'')}$$

$$h_{\text{opt}} = \left(\frac{R(K)}{\mu_2(K)^2 R(f'')}\right)^{1/5} n^{-1/5}$$

Key observations about the optimal bandwidth:

Scales as $n^{-1/5}$: This is slower than the typical $n^{-1/2}$ rate in parametric statistics. More data helps, but with diminishing returns.
Depends on the unknown $R(f'')$: The optimal bandwidth requires knowing a property of the very density we're trying to estimate! This is the fundamental challenge of bandwidth selection.
Depends on kernel through $R(K)$ and $\mu_2(K)$: Different kernels have different optimal bandwidths, but can be converted by a scale factor.

AMISE-Optimal Bandwidth Constants for Common Kernels
Kernel	$R(K)$	$\mu_2(K)$	Constant $C_K$
Gaussian	0.2821	1.0000	1.0592
Epanechnikov	0.6000	0.2000	2.3449
Biweight	0.7143	0.1429	2.7779
Triweight	0.8159	0.1111	3.1546
Uniform	0.5000	0.3333	1.7321

Optimal Convergence Rate

Substituting the optimal bandwidth back into the AMISE expression gives us the optimal achievable error. This reveals the fundamental rate at which KDE can estimate densities.

Substituting $h_{\text{opt}}$ into AMISE:

Let $h_{\text{opt}} = C \cdot n^{-1/5}$ where $C = \left(\frac{R(K)}{\mu_2(K)^2 R(f'')}\right)^{1/5}$.

Then:

$$\text{AMISE}(h_{\text{opt}}) = \frac{(C n^{-1/5})^4 \mu_2(K)^2 R(f'')}{4} + \frac{R(K)}{n \cdot C n^{-1/5}}$$

$$= \frac{C^4 \mu_2(K)^2 R(f'')}{4} n^{-4/5} + \frac{R(K)}{C} n^{-4/5}$$

After simplification (and noting that both terms contribute equally at the optimum):

$$\text{AMISE}(h_{\text{opt}}) = \frac{5}{4} \left(\mu_2(K)^2 R(K)\right)^{4/5} \left(R(f'')\right)^{1/5} n^{-4/5}$$

The fundamental result: The optimal MISE decreases at rate $n^{-4/5}$.

Slower Than Parametric Rates

Rate Comparison:

Estimation Method	Optimal Rate	Required $n$ for error $\epsilon$
Parametric (correct model)	$O(n^{-1})$	$O(1/\epsilon)$
KDE (univariate)	$O(n^{-4/5})$	$O(1/\epsilon^{5/4})$
KDE (bivariate)	$O(n^{-4/6})$	$O(1/\epsilon^{6/4})$
KDE ($d$-dimensional)	$O(n^{-4/(4+d)})$	$O(1/\epsilon^{(4+d)/4})$

Interpreting the rate:

To halve the MISE, you need approximately $2^{5/4} \approx 2.38$ times more data (vs. 2× for parametric).
To reduce error by a factor of 10, you need $10^{5/4} \approx 17.8$ times more data.
In high dimensions, the rate deteriorates further—this is the curse of dimensionality.

Why $n^{-4/5}$?

Effect of Sample Size on the Tradeoff

Understanding how sample size affects the bias-variance tradeoff provides crucial intuition for practitioners.

As $n$ increases:

Optimal bandwidth decreases: $h_{\text{opt}} \propto n^{-1/5}$. With more data, we can afford narrower kernels without excessive variance.
Variance decreases faster than bias: At the optimal $h$, both bias² and variance scale as $n^{-4/5}$, but if we fix $h$, variance decreases as $n^{-1}$ while bias remains constant.
More features become resolvable: Finer details in the density (smaller bumps, sharper peaks) become detectable as we can use smaller bandwidths.
The U-curve shifts: The minimum of the bias-variance curve moves left (smaller $h$) and down (lower total error).

Practical implications:

With small samples ($n < 50$): Expect significant smoothing bias; fine features will be obscured.
With moderate samples ($50 < n < 500$): Can capture major modes and overall shape.
With large samples ($n > 500$): Can detect secondary features, shoulders, and subtle asymmetries.
With very large samples ($n > 10000$): Fine structure becomes visible; computational efficiency may become the limiting factor.

Expected Optimal Bandwidth Scaling (Gaussian Kernel, $R(f'')=1$)
Sample Size $n$	Relative $h_{\text{opt}}$	Relative MISE
50	1.00	1.00
100	0.87	0.76
500	0.66	0.44
1,000	0.57	0.34
5,000	0.44	0.20
10,000	0.38	0.15
100,000	0.24	0.06

Rule of Thumb

Effect of Density Shape on the Tradeoff

The roughness of the true density, $R(f'') = \int (f''(x))^2 dx$, plays a critical role in determining how well we can estimate it. This quantity captures how 'wiggly' the density is.

Smooth densities (low $R(f'')$):

Examples: Normal distribution, uniform distribution, slowly varying mixtures
Allow larger optimal bandwidths (more smoothing is safe)
Achieve lower MISE for given sample size
Easier to estimate with good accuracy

Rough densities (high $R(f'')$):

Examples: Sharp multimodal distributions, densities with spikes, heavily skewed
Require smaller optimal bandwidths (must preserve fine structure)
Achieve higher MISE for given sample size
Harder to estimate accurately

Reference values of $R(f'')$:

Distribution	$R(f'')$	Relative Difficulty
Normal(0, 1)	$3/(8\sqrt{\pi}) \approx 0.2120$	Easy (baseline)
Uniform(-1, 1)	$0$ (boundary issues)	Very easy in interior
Laplace(0, 1)	$1/4 = 0.25$	Easy
Mixture of 2 close Normals	$\sim 0.5 - 2$	Moderate
Mixture of 5 Normals	$\sim 1 - 5$	Difficult
Claw density	$\sim 4.2$	Very difficult

The Circular Problem

Visual intuition:

Think of density estimation as fitting a flexible sheet to a surface:

Smooth density (low $R(f'')$): Like fitting a sheet to gently rolling hills. The sheet doesn't need to bend sharply, so even if we use a somewhat stiff sheet (large bandwidth), we get a good fit.
Rough density (high $R(f'')$): Like fitting a sheet to a mountainous terrain with sharp peaks and valleys. We need a very flexible sheet (small bandwidth) to capture the features, but this flexibility makes the sheet more susceptible to noise (data scatter).

The sample size determines how much noise we have to contend with. More data means less noise, allowing us to use a more flexible sheet without being overwhelmed by random fluctuations.

Local vs. Global Tradeoffs

So far, we've analyzed the global tradeoff (MISE over the entire domain). However, the bias-variance balance can vary significantly across the domain of the density.

Regional variation:

Recall the pointwise expressions:

$$\text{Bias}^2(x) \propto (f''(x))^2$$ $$\text{Var}(x) \propto f(x)$$

This means:

In high-density regions: Variance is larger (more data, but also more variability), bias depends on local curvature.
In low-density regions (tails): Variance is smaller in absolute terms but potentially larger relative to the true density. Bias can be significant if the density is changing shape.
Near modes: Often high curvature ($f''$ large) means higher bias; also high density means higher variance.
In valleys between modes: Curvature can be high (concave up), causing bias; density is low, so variance is lower.

The inadequacy of global bandwidth:

A single global bandwidth cannot optimally balance the tradeoff everywhere. The AMISE-optimal bandwidth is a compromise—it may undersmoooth in some regions and oversmooth in others.

This observation motivates variable bandwidth methods (covered in a later section) where $h$ adapts to local conditions.

Tail Behavior

Integrated vs. pointwise optimization:

Approach	Optimizes	Good for
Global MISE-optimal $h$	Overall average performance	General-purpose estimation
Local pointwise optimal $h(x)$	Performance at specific location	Understanding local behavior
Variable bandwidth $h(x_i)$	Adapts to local data density	Densities with varying complexity
Adaptive estimation	Compromise between local adaptation and stability	Difficult densities

In practice, most applications use a global bandwidth for simplicity, but awareness of local tradeoffs helps interpret estimates critically—especially in regions of high curvature or low density.

Practical Implications for Bandwidth Selection

The theoretical analysis of bias-variance tradeoff has immediate practical implications for how we approach density estimation.

Key practical lessons:

Don't expect perfection: The nonparametric rate $n^{-4/5}$ is fundamental; no bandwidth selection method can do better asymptotically. Accept that estimates are approximations.
Sample size matters a lot: Because of the $n^{-1/5}$ scaling, you need substantial sample sizes for accurate estimation. With $n = 50$, expect noticeable errors; with $n = 5000$, expect reasonably accurate estimates.
Undersmoothing is often worse than oversmoothing: Excessive variance (undersmoothing) produces misleading spurious features; excessive bias (oversmoothing) may miss features but is less misleading. When in doubt, err slightly on the side of smoothing.
Visual inspection is essential: Plot the estimate with several bandwidth values. If conclusions change dramatically with bandwidth, they're not robust and more data is needed.
Context matters for bandwidth choice: The 'optimal' bandwidth depends on what question you're answering. For finding modes, smaller bandwidths may be better. For overall shape, larger bandwidths may suffice.

Bandwidth Selection Strategies

•Rule of thumb (Silverman/Scott): Quick, based on normal reference; works well for unimodal, roughly symmetric densities.
•Plug-in methods: Estimate $R(f'')$ from the data, then compute optimal $h$; good balance of speed and accuracy.
•Cross-validation (LSCV, BCV): Data-driven selection minimizing estimated error; more robust to non-normality but higher variance.
•Subjective selection: Try multiple bandwidths, select based on domain knowledge and goals; reasonable for exploratory analysis.
•Sheather-Jones method: Solves a plug-in equation iteratively; often considered the best automatic choice.

The Next Step

Summary: Bias-Variance Tradeoff

The bias-variance tradeoff is the theoretical bedrock upon which all bandwidth selection methods are built. Let's consolidate the key insights:

Key Takeaways

•MSE = Bias² + Variance — This fundamental decomposition governs the error of all estimators.
•Bias scales as $h^2$, Variance scales as $1/(nh)$ — These opposing dependencies create the fundamental tension.
•Small $h$ → low bias, high variance — Estimates are wiggly and follow noise.
•Large $h$ → high bias, low variance — Estimates are smooth but may miss features.
•AMISE optimal bandwidth scales as $n^{-1/5}$ — This rate is slower than parametric estimation.
•Optimal MISE converges at rate $n^{-4/5}$ — The fundamental nonparametric rate for KDE.
•Optimal $h$ depends on unknown $R(f'')$ — This circularity motivates bandwidth selection methods.
•Local and global tradeoffs can differ — A single bandwidth is a compromise across the domain.

Coming Next: Bandwidth Selection Methods