Density Estimation Fundamentals - Learning Module

Loading content...

0/278

Bandwidth Selection

The Art and Science of Choosing Bandwidth

If kernel density estimation were a musical instrument, bandwidth would be its tuning—get it right, and the estimate sings with clarity and precision; get it wrong, and the result is either screeching noise (undersmoothing) or muffled silence (oversmoothing).

We established in the previous page that bandwidth profoundly impacts KDE quality: too small produces spiky, overfitted estimates; too large obliterates important features. The optimal bandwidth depends on the unknown true density $f$—creating a fundamental circular problem. How do we select a bandwidth without already knowing what we're trying to estimate?

This page is devoted entirely to answering this question. We'll explore multiple philosophies and methods: quick rule-of-thumb calculations, sophisticated plug-in estimators, principled cross-validation approaches, and adaptive techniques that vary bandwidth locally. Each method embodies different assumptions and tradeoffs, and understanding this landscape is essential for applying KDE effectively in practice.

What You Will Learn

By the end of this page, you will master multiple bandwidth selection strategies including Silverman's rule, Scott's rule, plug-in estimators, least-squares cross-validation, likelihood cross-validation, and adaptive bandwidth methods. You'll understand when each approach excels, their computational costs, and practical guidance for choosing between them.

The Bandwidth Selection Problem

Recall the fundamental result for optimal bandwidth. Minimizing the Mean Integrated Squared Error (MISE) yields:

$$h^* = \left(\frac{R(K)}{n [\mu_2(K)]^2 R(f'')}\right)^{1/5}$$

where:

$R(K) = \int K^2(u) , du$ — roughness of the kernel (known)
$\mu_2(K) = \int u^2 K(u) , du$ — second moment of the kernel (known)
$R(f'') = \int [f''(x)]^2 , dx$ — roughness of the true density's second derivative (unknown)

The problem is clear: the optimal bandwidth depends on $R(f'')$, which involves the unknown density $f$. This is not merely inconvenient—it's a fundamental epistemological challenge. Different approaches to bandwidth selection correspond to different strategies for breaking this circularity.

Strategies for Bandwidth Selection
Strategy	Approach	Assumptions	Complexity
Reference/Rule-of-Thumb	Assume f is from known family	Usually Gaussian	O(n)
Plug-in Methods	Estimate R(f'') iteratively	Smoothness of f	O(n log n)
Cross-Validation	Minimize estimation error estimate	None on f shape	O(n² log n)
Bayesian	Place prior on h, compute posterior	Prior specification	Variable
Adaptive	Let h vary with x	Local behavior differs	O(n²)

Evaluation Criteria for Bandwidth Selectors:

How do we judge a bandwidth selector? Key criteria include:

Accuracy: How close is selected $\hat{h}$ to optimal $h^*$?
- Measured by relative error: $(\hat{h} - h^)/h^$
- Or by ratio of achieved MISE to optimal MISE
Consistency: Does $\hat{h}/h^* \to 1$ as $n \to \infty$?
Convergence Rate: How fast does $\hat{h}$ converge to $h^*$?
- Relative error $O(n^{-\alpha})$ for various $\alpha$
Robustness: Performance across different true densities?
Computational Cost: Time complexity as function of $n$?
Simplicity: Ease of implementation and understanding?

No method dominates on all criteria—hence the proliferation of approaches.

Reference Rules (Rules of Thumb)

Reference rules assume the true density belongs to a specific family (typically Gaussian), compute the optimal bandwidth for that reference, and substitute sample estimates. They're simple, fast, and often sufficient for exploratory analysis.

Silverman's Rule (1986):

For a Gaussian density $f = N(\mu, \sigma^2)$: $$R(f'') = \frac{3}{8\sqrt{\pi}\sigma^5}$$

With Gaussian kernel ($R(K) = 1/(2\sqrt{\pi})$, $\mu_2(K) = 1$):

$$h^*_{\text{Gauss}} = \left(\frac{4\sigma^5}{3n}\right)^{1/5} = \left(\frac{4}{3n}\right)^{1/5} \sigma \approx 1.06 \cdot \sigma \cdot n^{-1/5}$$

Robust Version (Default in most software): $$h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$$

The factor 0.9 adjusts for using the minimum of two scale estimates (slightly lower than 1.06 to prevent oversmoothing). The IQR/1.34 equals $\sigma$ for Gaussian but is robust to outliers.

Scott's Rule (1992):

$$h = 1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$$

Essentially the same as Silverman's non-robust version. Sometimes used with the coefficient rounded to 1.06.

Comparison of Coefficients:

For Gaussian kernel and Gaussian reference:

Exact optimal: $\left(\frac{4}{3n}\right)^{1/5}\sigma \approx 1.059 \cdot \sigma \cdot n^{-1/5}$
Scott: $1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$
Silverman (robust): $0.9 \cdot \hat{\sigma} \cdot n^{-1/5}$ (with IQR adjustment)

For other kernels:

The coefficient changes based on $R(K)$ and $\mu_2(K)$. For Epanechnikov kernel: $$h = 2.34 \cdot \hat{\sigma} \cdot n^{-1/5}$$

(The larger coefficient reflects the kernel's shape, not that it requires wider bandwidth.)

When Reference Rules Fail

Reference rules assume unimodality and approximate symmetry. For multimodal densities, they systematically oversmooth—merging separate modes into one. For heavily skewed or heavy-tailed distributions, they may also perform poorly. Always visualize results with multiple bandwidths when the data may not be Gaussian-like.

Oversmoothing Rule:

Sheather and Jones (1991) proposed using an oversmoothed reference to prevent undersmoothing: $$h_{\text{os}} = \left(\frac{243 R(K)}{35 [\mu_2(K)]^2 n}\right)^{1/5} \hat{\sigma}$$

This is designed to be larger than optimal, ensuring that features aren't spuriously created. One can then reduce from this starting point.

Multivariate Reference Rules:

For $d$-dimensional data with bandwidth matrix $H \propto \widehat{\Sigma}$:

Scott's Rule: $$H = n^{-2/(d+4)} \widehat{\Sigma}$$

Silverman's Rule: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \widehat{\Sigma}$$

These become increasingly unreliable as $d$ grows, because the Gaussian reference becomes less representative of complex multivariate structure.

Plug-in Methods

Plug-in methods estimate $R(f'') = \int [f''(x)]^2 dx$ directly from the data, then substitute ("plug in") this estimate into the optimal bandwidth formula. The key insight is that $R(f'')$ can itself be estimated by kernel methods.

The Idea:

To estimate $R(f'') = \int [f''(x)]^2 dx$, we can:

Estimate $f''(x)$ using a kernel derivative estimator
Integrate the squared estimate

But the kernel estimator of $f''$ has its own bandwidth $g$, which itself depends on $R(f^{(4)})$. This creates a hierarchy of estimation problems.

Sheather-Jones Plug-in (1991):

The most sophisticated and widely-used plug-in method. It solves the bandwidth selection problem by:

Estimate $R(f^{(6)})$ using a normal reference
Use this to set bandwidth $g_6$ for estimating $f^{(4)}$
Estimate $R(f^{(4)})$ using bandwidth $g_6$
Use this to set bandwidth $g_4$ for estimating $f''$
Estimate $R(f'')$ using bandwidth $g_4$
Plug $\widehat{R(f'')}$ into the $h^*$ formula

Mathematical Details:

To estimate $\psi_r = \int f^{(r)}(x)^2 dx$ (where $f^{(r)}$ is the $r$-th derivative), use:

$$\hat{\psi}r = \frac{1}{n^2 g^{r+1}} \sum{i=1}^n \sum_{j=1}^n K^{(r)}\left(\frac{x_i - x_j}{g}\right)$$

where $K^{(r)}$ is the $r$-th derivative of the kernel.

The bandwidth $g$ for estimating $\psi_r$ depends on $\psi_{r+2}$. This creates the hierarchy:

$$g_4 = C_4(K) \cdot \psi_6^{-1/(5+6)} \cdot n^{-1/(5+6)}$$ $$g_2 = C_2(K) \cdot \psi_4^{-1/(5+4)} \cdot n^{-1/(5+4)}$$ $$h = C_0(K) \cdot \psi_2^{-1/(5+2)} \cdot n^{-1/(5+2)}$$

where $C_r(K)$ are constants depending on the kernel.

The chain is stopped by using a Gaussian reference for the highest derivative.

Sheather-Jones in Practice

Sheather-Jones is implemented in most statistical software (e.g., stats::bw.SJ in R, scipy.stats.gaussian_kde with method='scott' or 'silverman' in Python). It generally outperforms simple reference rules, especially for non-normal densities, and is considered the state-of-the-art for plug-in bandwidth selection.

Properties of Plug-in Methods:

Advantages:

Faster convergence rate than cross-validation: $\hat{h}/h^* - 1 = O_p(n^{-5/14})$ vs $O_p(n^{-1/10})$ for CV
Computationally efficient: $O(n \log n)$ with FFT
Adapts to non-Gaussian densities better than reference rules

Disadvantages:

Still relies on normal reference at the top of the hierarchy
Can fail for very complex densities (many modes)
More complex to implement than reference rules

Direct Plug-in (DPI) Variant:

Simpler than Sheather-Jones, stops the hierarchy earlier:

Estimate $\psi_4$ using normal reference for $\psi_6$
Estimate $\psi_2$ using the estimated $\psi_4$
Compute $h$ from estimated $\psi_2$

Slightly less accurate but easier to implement.

Plug-in Method Comparison
Method	Number of Stages	Convergence Rate	Complexity
Normal Reference (0-stage)	0	Biased for non-normal	O(n)
Direct Plug-in (1-stage)	1	O(n⁻⁵/¹⁴)	O(n log n)
Sheather-Jones (2-stage)	2	O(n⁻⁵/¹⁴)	O(n log n)
3-stage	3	O(n⁻⁵/¹⁴)	O(n log n)

Cross-Validation Methods

Cross-validation (CV) methods select bandwidth by optimizing an estimate of the density estimator's error that doesn't require knowing the true density. Unlike plug-in methods, CV makes no assumptions about the shape of $f$—it is fully nonparametric.

Least-Squares Cross-Validation (LSCV):

Also called Unbiased Cross-Validation (UCV). We want to minimize the Integrated Squared Error: $$\text{ISE}(h) = \int [\hat{f}_h(x) - f(x)]^2 dx = \int \hat{f}_h^2 - 2\int \hat{f}_h f + \int f^2$$

The last term $\int f^2$ doesn't depend on $h$, so we minimize: $$J(h) = \int \hat{f}_h^2 - 2\int \hat{f}_h f$$

The first integral can be computed directly. The second term $\int \hat{f}_h f = \mathbb{E}_f[\hat{f}_h(X)]$ can be estimated by leave-one-out:

$$\widehat{\int \hat{f}h f} = \frac{1}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i)$$

where $\hat{f}_{h,-i}(x_i)$ is the KDE at $x_i$ using all points except $x_i$.

LSCV Score:

$$\text{LSCV}(h) = \int \hat{f}h^2(x) dx - \frac{2}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i)$$

For Gaussian kernel, both terms have closed-form expressions:

$$\int \hat{f}h^2 dx = \frac{1}{n^2 h \sqrt{4\pi}} \sum{i=1}^n \sum_{j=1}^n \exp\left(-\frac{(x_i - x_j)^2}{4h^2}\right)$$

$$\hat{f}{h,-i}(x_i) = \frac{1}{(n-1)h\sqrt{2\pi}} \sum{j eq i} \exp\left(-\frac{(x_i - x_j)^2}{2h^2}\right)$$

Minimize LSCV$(h)$ over a grid of $h$ values.

Properties of LSCV:

Consistent: $\hat{h}_{\text{LSCV}} / h^* \xrightarrow{P} 1$
Slow convergence: relative error $O_p(n^{-1/10})$
High variability: the LSCV curve is often wiggly, making optimization difficult
Tends to undersmooth for small samples

Biased Cross-Validation (BCV):

A variant that reduces the variability of LSCV at the cost of some bias:

$$\text{BCV}(h) = \int \hat{f}h^2 + \frac{R(K)}{nh} - \frac{2}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i) - \frac{2K(0)}{nh}$$

The additional terms stabilize the objective function, reducing variance but introducing small bias.

Likelihood Cross-Validation (LCV):

Instead of minimizing ISE, maximize the leave-one-out log-likelihood:

$$\text{LCV}(h) = \sum_{i=1}^n \log \hat{f}_{h,-i}(x_i)$$

This measures how well the density predicts held-out points—maximizing average log-density of truly sampled points.

Equivalent form (Pseudo-likelihood): $$\text{LCV}(h) = n \cdot \frac{1}{n}\sum_{i=1}^n \log \hat{f}_{h,-i}(x_i)$$

Related to minimizing KL divergence $D_{KL}(f | \hat{f})$ rather than ISE.

LSCV Properties

•Minimizes estimated ISE
•Consistent
•Convergence rate: n⁻¹/¹⁰
•High variance in selection
•Tends to undersmooth
•Most common in practice

LCV Properties

•Maximizes predictive likelihood
•Consistent
•Similar convergence to LSCV
•More stable optimization
•Sensitive to outliers
•Preferred for generative modeling

LSCV vs LCV Trade-offs

LSCV targets L² error (ISE) while LCV targets KL divergence. For visualization, they often give similar results. For generative modeling (sampling from the density), LCV is more natural. Both can suffer from high variance; using wide search grids and smooth optimization helps.

Comparing Selection Methods

How do the various bandwidth selection methods compare in practice? Extensive simulation studies (e.g., Park & Marron 1990, Jones et al. 1996) have evaluated methods across different density types.

Bandwidth Selector Comparison
Method	Speed	Accuracy (Normal)	Accuracy (Bimodal)	Robustness
Silverman's Rule	Very Fast	Excellent	Poor (oversmooths)	High
Scott's Rule	Very Fast	Excellent	Poor (oversmooths)	Moderate
Sheather-Jones	Fast	Excellent	Good	Moderate
LSCV	Slow	Good	Good	Variable
LCV	Slow	Good	Good	Variable
BCV	Moderate	Good	Good	High

Key Findings from Simulation Studies:

For Unimodal Densities: Reference rules and plug-in methods perform similarly well. Simpler is often better.
For Multimodal Densities: Cross-validation and plug-in methods outperform reference rules, which systematically oversmooth.
For Small Samples (n < 100): All methods have high variability. Reference rules are often as good as anything.
For Large Samples (n > 1000): Plug-in methods show their theoretical advantage in convergence rate.
Sheather-Jones Dominance: Across many scenarios, Sheather-Jones offers the best balance of speed and accuracy.
CV Variability: Cross-validation can be erratic, especially for small samples or smooth densities. Multiple minima in the CV curve cause problems.

Practical Recommendations:

Start with Silverman's robust rule for quick exploration. It's rarely catastrophic.
For publication/final analysis: Use Sheather-Jones (if available) or LSCV.
When you suspect multimodality: Cross-validation or plug-in; avoid reference rules.
When outliers are present: Use robust scale estimate (IQR-based) in reference rules.
When computational cost matters: Reference rules or plug-in (both are fast).
When you have no prior knowledge: Cross-validation makes the fewest assumptions.
Always validate visually: Plot multiple bandwidths spanning 0.5× to 2× the selected value. Features that appear across the range are robust.

The Meta-Advice

No bandwidth selector is universally best. The 'right' choice depends on the goal (exploration vs inference), sample size, computational budget, and prior knowledge. When in doubt, try multiple methods and compare results. Agreement suggests robustness; disagreement warrants investigation.

Adaptive Bandwidth Methods

Fixed-bandwidth KDE uses the same smoothing everywhere. But optimal smoothing may vary: high-density regions (modes) benefit from smaller bandwidth (capture detail), while low-density regions (tails) need larger bandwidth (reduce variance). Adaptive (variable) bandwidth methods address this.

Two Philosophies:

Sample-point adaptive: Each data point $x_i$ has its own bandwidth $h_i$. The kernel placed at $x_i$ has width $h_i$.
Design-point adaptive: The bandwidth at evaluation point $x$ depends on local density, $h(x) = h \cdot \lambda(x)$.

Both aim to achieve the same effect: narrower kernels where data is dense, wider kernels where data is sparse.

Abramson's Square-Root Rule (1982):

A widely-used adaptive approach. The bandwidth at point $x_i$ is: $$h_i = h \cdot \left(\frac{\bar{f}}{\tilde{f}(x_i)}\right)^{1/2}$$

where:

$h$ is a global bandwidth (e.g., from Silverman's rule)
$\tilde{f}(x_i)$ is a pilot density estimate at $x_i$ (fixed-bandwidth KDE)
$\bar{f} = \left(\prod_{i=1}^n \tilde{f}(x_i)\right)^{1/n}$ is the geometric mean of pilot densities

Intuition: If $\tilde{f}(x_i)$ is high (dense region), then $h_i$ is small. If $\tilde{f}(x_i)$ is low (sparse region), then $h_i$ is large.

The adaptive KDE is: $$\hat{f}{\text{adapt}}(x) = \frac{1}{n} \sum{i=1}^n \frac{1}{h_i} K\left(\frac{x - x_i}{h_i}\right)$$

Normalization: The geometric mean $\bar{f}$ ensures that $\prod h_i$ remains approximately constant, preserving the overall smoothing level.

Properties of Adaptive KDE:

Advantages:

Better captures peaks and troughs simultaneously
Reduces bias near modes
Reduces variance in tails
Achieves faster convergence for some density classes

Disadvantages:

More complex to implement
Requires pilot estimate (introduces additional tuning)
May not integrate to exactly 1
Harder to analyze theoretically

Balloon Estimator:

A design-point adaptive approach: $$\hat{f}_{\text{balloon}}(x) = \frac{k}{n \cdot V_d \cdot r_k(x)^d}$$

where $r_k(x)$ is the distance to the $k$-th nearest neighbor of $x$, and $V_d$ is the volume of a unit $d$-sphere. Larger $r_k$ (sparse region) gives smaller density. This is k-NN density estimation.

Sample-Smoothing Estimator:

A sample-point adaptive approach where $h_i \propto r_k(x_i)$ (distance to $k$-th nearest neighbor of $x_i$).

When to Use Adaptive Bandwidth

Adaptive bandwidth is most valuable when the density has sharp features (peaks) and long tails simultaneously—e.g., income distributions, insurance claims. For roughly Gaussian densities, the benefit is minimal. The added complexity is justified when fixed-bandwidth clearly fails to capture structure.

Implementation and Practical Guidance

Armed with theoretical understanding, let's consolidate practical implementation guidance.

bandwidth_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy import stats
from scipy.optimize import minimize_scalar
 
def silverman_bandwidth(data):
    """Silverman's robust rule of thumb."""
    n = len(data)
    sigma = np.std(data, ddof=1)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    scale = min(sigma, iqr / 1.34)
    return 0.9 * scale * n**(-0.2)
 
def scott_bandwidth(data):
    """Scott's rule of thumb."""
    n = len(data)
    sigma = np.std(data, ddof=1)
    return 1.059 * sigma * n**(-0.2)
 
def lscv_bandwidth(data, h_range=None):
    """
    Least-Squares Cross-Validation bandwidth selection.
    
    Returns bandwidth that minimizes estimated ISE.
    """
    data = np.asarray(data)
    n = len(data)
    
    # Default search range: 0.1 to 5 times Silverman's rule
    if h_range is None:
        h_silver = silverman_bandwidth(data)
        h_range = (0.1 * h_silver, 5 * h_silver)
    
    def lscv_score(h):
        """Compute LSCV score for bandwidth h."""
        if h <= 0:
            return np.inf
        
        # First term: integral of f_hat squared
        # For Gaussian kernel: sum_i sum_j K((xi-xj)/(sqrt(2)*h)) / (n^2 * h * sqrt(4*pi))
        diff_matrix = data[:, np.newaxis] - data[np.newaxis, :]
        term1 = np.sum(np.exp(-diff_matrix**2 / (4 * h**2))) / (n**2 * h * np.sqrt(4 * np.pi))
        
        # Second term: 2 * mean of leave-one-out estimates
        # Exclude diagonal (i != j)
        np.fill_diagonal(diff_matrix, np.inf)  # Exclude self
        loo_contributions = np.sum(np.exp(-diff_matrix**2 / (2 * h**2)), axis=1)
        term2 = 2 * np.mean(loo_contributions) / ((n - 1) * h * np.sqrt(2 * np.pi))
        
        return term1 - term2
    
    # Minimize over range using golden section search
    result = minimize_scalar(lscv_score, bounds=h_range, method='bounded')
    return result.x
 
def lcv_bandwidth(data, h_range=None):
    """
    Likelihood Cross-Validation bandwidth selection.
    
    Returns bandwidth that maximizes leave-one-out log-likelihood.
    """
    data = np.asarray(data)
    n = len(data)
    
    if h_range is None:
        h_silver = silverman_bandwidth(data)
        h_range = (0.1 * h_silver, 5 * h_silver)
    
    def neg_lcv_score(h):
        """Negative LCV (for minimization)."""
        if h <= 0:
            return np.inf
        
        diff_matrix = data[:, np.newaxis] - data[np.newaxis, :]
        kernel_values = np.exp(-diff_matrix**2 / (2 * h**2)) / (h * np.sqrt(2 * np.pi))
        np.fill_diagonal(kernel_values, 0)  # Leave-one-out
        
        loo_densities = np.sum(kernel_values, axis=1) / (n - 1)
        
        # Avoid log(0)
        loo_densities = np.maximum(loo_densities, 1e-300)
        
        return -np.sum(np.log(loo_densities))
    
    result = minimize_scalar(neg_lcv_score, bounds=h_range, method='bounded')
    return result.x
 
# Example usage
np.random.seed(42)
# Bimodal data where Silverman will oversmooth
data = np.concatenate([np.random.normal(-2, 0.8, 200), np.random.normal(2, 0.8, 200)])
 
h_silverman = silverman_bandwidth(data)
h_scott = scott_bandwidth(data)
h_lscv = lscv_bandwidth(data)
h_lcv = lcv_bandwidth(data)
 
print("Bandwidth Selection Comparison (Bimodal Data):")
print(f"  Silverman: {h_silverman:.4f}")
print(f"  Scott:     {h_scott:.4f}")
print(f"  LSCV:      {h_lscv:.4f}")
print(f"  LCV:       {h_lcv:.4f}")

Decision Flowchart:

Start
  │
  ├── Is this exploratory analysis?
  │   Yes → Use Silverman's Rule
  │   No ↓
  │
  ├── Do you suspect multiple modes?
  │   Yes → Use Cross-Validation or Plug-in
  │   No ↓
  │
  ├── Is sample size > 500?
  │   Yes → Sheather-Jones Plug-in
  │   No → Silverman's Rule (robust)
  │
  └── Always: Visualize with multiple bandwidths!

Common Mistakes to Avoid:

Using default without checking: Software defaults are sensible but not always appropriate for your data.
Ignoring visual inspection: Numbers can't tell you if modes are real or artifacts.
Over-relying on CV: For small samples, CV can be very noisy.
Reporting single bandwidth: Always report which method was used and sensitivity to choice.
Ignoring domain knowledge: If you know the density should be unimodal (or bimodal, etc.), use that information.

Summary and Path Forward

Bandwidth selection is the critical practical challenge in kernel density estimation. We've explored the major approaches in depth:

Key Takeaways

•The optimal bandwidth depends on $R(f'')$, which involves the unknown density—creating a fundamental circular problem.
•Reference rules (Silverman, Scott) assume a Gaussian reference distribution. They're fast and work well for unimodal data but oversmooth multimodal densities.
•Plug-in methods (Sheather-Jones) estimate $R(f'')$ directly through a cascade of derivative estimates. They offer excellent performance with $O(n \log n)$ complexity.
•Cross-validation (LSCV, LCV) makes no assumptions about $f$ but has slow convergence ($n^{-1/10}$) and can be variable for small samples.
•Adaptive bandwidth varies smoothing locally—narrower near modes, wider in tails—and is valuable for densities with sharp peaks and long tails.
•No method universally dominates. Choice depends on sample size, prior knowledge, computational budget, and whether the goal is exploration or inference.
•Visual validation is essential. Always check results with multiple bandwidths to identify robust features.

What's Next:

With univariate density estimation fully explored, we turn to the challenge of multivariate density estimation. The next page addresses the curse of dimensionality head-on, exploring techniques for estimating joint distributions and understanding how sample requirements scale with dimension.

Page Complete

You now have comprehensive mastery of bandwidth selection for kernel density estimation. From quick rule-of-thumb calculations to sophisticated plug-in and cross-validation methods, you can select bandwidths appropriately for any application and understand the tradeoffs involved.

Bandwidth Selection

The Art and Science of Choosing Bandwidth

What You Will Learn

The Bandwidth Selection Problem

Recall the fundamental result for optimal bandwidth. Minimizing the Mean Integrated Squared Error (MISE) yields:

$$h^* = \left(\frac{R(K)}{n [\mu_2(K)]^2 R(f'')}\right)^{1/5}$$

where:

$R(K) = \int K^2(u) , du$ — roughness of the kernel (known)
$\mu_2(K) = \int u^2 K(u) , du$ — second moment of the kernel (known)
$R(f'') = \int [f''(x)]^2 , dx$ — roughness of the true density's second derivative (unknown)

Strategies for Bandwidth Selection
Strategy	Approach	Assumptions	Complexity
Reference/Rule-of-Thumb	Assume f is from known family	Usually Gaussian	O(n)
Plug-in Methods	Estimate R(f'') iteratively	Smoothness of f	O(n log n)
Cross-Validation	Minimize estimation error estimate	None on f shape	O(n² log n)
Bayesian	Place prior on h, compute posterior	Prior specification	Variable
Adaptive	Let h vary with x	Local behavior differs	O(n²)

Evaluation Criteria for Bandwidth Selectors:

How do we judge a bandwidth selector? Key criteria include:

Accuracy: How close is selected $\hat{h}$ to optimal $h^*$?
- Measured by relative error: $(\hat{h} - h^)/h^$
- Or by ratio of achieved MISE to optimal MISE
Consistency: Does $\hat{h}/h^* \to 1$ as $n \to \infty$?
Convergence Rate: How fast does $\hat{h}$ converge to $h^*$?
- Relative error $O(n^{-\alpha})$ for various $\alpha$
Robustness: Performance across different true densities?
Computational Cost: Time complexity as function of $n$?
Simplicity: Ease of implementation and understanding?

No method dominates on all criteria—hence the proliferation of approaches.

Reference Rules (Rules of Thumb)

Silverman's Rule (1986):

For a Gaussian density $f = N(\mu, \sigma^2)$: $$R(f'') = \frac{3}{8\sqrt{\pi}\sigma^5}$$

With Gaussian kernel ($R(K) = 1/(2\sqrt{\pi})$, $\mu_2(K) = 1$):

$$h^*_{\text{Gauss}} = \left(\frac{4\sigma^5}{3n}\right)^{1/5} = \left(\frac{4}{3n}\right)^{1/5} \sigma \approx 1.06 \cdot \sigma \cdot n^{-1/5}$$

Robust Version (Default in most software): $$h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$$

The factor 0.9 adjusts for using the minimum of two scale estimates (slightly lower than 1.06 to prevent oversmoothing). The IQR/1.34 equals $\sigma$ for Gaussian but is robust to outliers.

Scott's Rule (1992):

$$h = 1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$$

Essentially the same as Silverman's non-robust version. Sometimes used with the coefficient rounded to 1.06.

Comparison of Coefficients:

For Gaussian kernel and Gaussian reference:

Exact optimal: $\left(\frac{4}{3n}\right)^{1/5}\sigma \approx 1.059 \cdot \sigma \cdot n^{-1/5}$
Scott: $1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$
Silverman (robust): $0.9 \cdot \hat{\sigma} \cdot n^{-1/5}$ (with IQR adjustment)

For other kernels:

The coefficient changes based on $R(K)$ and $\mu_2(K)$. For Epanechnikov kernel: $$h = 2.34 \cdot \hat{\sigma} \cdot n^{-1/5}$$

(The larger coefficient reflects the kernel's shape, not that it requires wider bandwidth.)

When Reference Rules Fail

Oversmoothing Rule:

Sheather and Jones (1991) proposed using an oversmoothed reference to prevent undersmoothing: $$h_{\text{os}} = \left(\frac{243 R(K)}{35 [\mu_2(K)]^2 n}\right)^{1/5} \hat{\sigma}$$

This is designed to be larger than optimal, ensuring that features aren't spuriously created. One can then reduce from this starting point.

Multivariate Reference Rules:

For $d$-dimensional data with bandwidth matrix $H \propto \widehat{\Sigma}$:

Scott's Rule: $$H = n^{-2/(d+4)} \widehat{\Sigma}$$

Silverman's Rule: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \widehat{\Sigma}$$

These become increasingly unreliable as $d$ grows, because the Gaussian reference becomes less representative of complex multivariate structure.

Plug-in Methods

The Idea:

To estimate $R(f'') = \int [f''(x)]^2 dx$, we can:

Estimate $f''(x)$ using a kernel derivative estimator
Integrate the squared estimate

But the kernel estimator of $f''$ has its own bandwidth $g$, which itself depends on $R(f^{(4)})$. This creates a hierarchy of estimation problems.

Sheather-Jones Plug-in (1991):

The most sophisticated and widely-used plug-in method. It solves the bandwidth selection problem by:

Estimate $R(f^{(6)})$ using a normal reference
Use this to set bandwidth $g_6$ for estimating $f^{(4)}$
Estimate $R(f^{(4)})$ using bandwidth $g_6$
Use this to set bandwidth $g_4$ for estimating $f''$
Estimate $R(f'')$ using bandwidth $g_4$
Plug $\widehat{R(f'')}$ into the $h^*$ formula

Mathematical Details:

To estimate $\psi_r = \int f^{(r)}(x)^2 dx$ (where $f^{(r)}$ is the $r$-th derivative), use:

$$\hat{\psi}r = \frac{1}{n^2 g^{r+1}} \sum{i=1}^n \sum_{j=1}^n K^{(r)}\left(\frac{x_i - x_j}{g}\right)$$

where $K^{(r)}$ is the $r$-th derivative of the kernel.

The bandwidth $g$ for estimating $\psi_r$ depends on $\psi_{r+2}$. This creates the hierarchy:

$$g_4 = C_4(K) \cdot \psi_6^{-1/(5+6)} \cdot n^{-1/(5+6)}$$ $$g_2 = C_2(K) \cdot \psi_4^{-1/(5+4)} \cdot n^{-1/(5+4)}$$ $$h = C_0(K) \cdot \psi_2^{-1/(5+2)} \cdot n^{-1/(5+2)}$$

where $C_r(K)$ are constants depending on the kernel.

The chain is stopped by using a Gaussian reference for the highest derivative.

Sheather-Jones in Practice

Properties of Plug-in Methods:

Advantages:

Faster convergence rate than cross-validation: $\hat{h}/h^* - 1 = O_p(n^{-5/14})$ vs $O_p(n^{-1/10})$ for CV
Computationally efficient: $O(n \log n)$ with FFT
Adapts to non-Gaussian densities better than reference rules

Disadvantages:

Still relies on normal reference at the top of the hierarchy
Can fail for very complex densities (many modes)
More complex to implement than reference rules

Direct Plug-in (DPI) Variant:

Simpler than Sheather-Jones, stops the hierarchy earlier:

Estimate $\psi_4$ using normal reference for $\psi_6$
Estimate $\psi_2$ using the estimated $\psi_4$
Compute $h$ from estimated $\psi_2$

Slightly less accurate but easier to implement.

Plug-in Method Comparison
Method	Number of Stages	Convergence Rate	Complexity
Normal Reference (0-stage)	0	Biased for non-normal	O(n)
Direct Plug-in (1-stage)	1	O(n⁻⁵/¹⁴)	O(n log n)
Sheather-Jones (2-stage)	2	O(n⁻⁵/¹⁴)	O(n log n)
3-stage	3	O(n⁻⁵/¹⁴)	O(n log n)

Cross-Validation Methods

Least-Squares Cross-Validation (LSCV):

Also called Unbiased Cross-Validation (UCV). We want to minimize the Integrated Squared Error: $$\text{ISE}(h) = \int [\hat{f}_h(x) - f(x)]^2 dx = \int \hat{f}_h^2 - 2\int \hat{f}_h f + \int f^2$$

The last term $\int f^2$ doesn't depend on $h$, so we minimize: $$J(h) = \int \hat{f}_h^2 - 2\int \hat{f}_h f$$

The first integral can be computed directly. The second term $\int \hat{f}_h f = \mathbb{E}_f[\hat{f}_h(X)]$ can be estimated by leave-one-out:

$$\widehat{\int \hat{f}h f} = \frac{1}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i)$$

where $\hat{f}_{h,-i}(x_i)$ is the KDE at $x_i$ using all points except $x_i$.

LSCV Score:

$$\text{LSCV}(h) = \int \hat{f}h^2(x) dx - \frac{2}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i)$$

For Gaussian kernel, both terms have closed-form expressions:

$$\int \hat{f}h^2 dx = \frac{1}{n^2 h \sqrt{4\pi}} \sum{i=1}^n \sum_{j=1}^n \exp\left(-\frac{(x_i - x_j)^2}{4h^2}\right)$$

$$\hat{f}{h,-i}(x_i) = \frac{1}{(n-1)h\sqrt{2\pi}} \sum{j eq i} \exp\left(-\frac{(x_i - x_j)^2}{2h^2}\right)$$

Minimize LSCV$(h)$ over a grid of $h$ values.

Properties of LSCV:

Consistent: $\hat{h}_{\text{LSCV}} / h^* \xrightarrow{P} 1$
Slow convergence: relative error $O_p(n^{-1/10})$
High variability: the LSCV curve is often wiggly, making optimization difficult
Tends to undersmooth for small samples

Biased Cross-Validation (BCV):

A variant that reduces the variability of LSCV at the cost of some bias:

$$\text{BCV}(h) = \int \hat{f}h^2 + \frac{R(K)}{nh} - \frac{2}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i) - \frac{2K(0)}{nh}$$

The additional terms stabilize the objective function, reducing variance but introducing small bias.

Likelihood Cross-Validation (LCV):

Instead of minimizing ISE, maximize the leave-one-out log-likelihood:

$$\text{LCV}(h) = \sum_{i=1}^n \log \hat{f}_{h,-i}(x_i)$$

This measures how well the density predicts held-out points—maximizing average log-density of truly sampled points.

Equivalent form (Pseudo-likelihood): $$\text{LCV}(h) = n \cdot \frac{1}{n}\sum_{i=1}^n \log \hat{f}_{h,-i}(x_i)$$

Related to minimizing KL divergence $D_{KL}(f | \hat{f})$ rather than ISE.

LSCV Properties

•Minimizes estimated ISE
•Consistent
•Convergence rate: n⁻¹/¹⁰
•High variance in selection
•Tends to undersmooth
•Most common in practice

LCV Properties

•Maximizes predictive likelihood
•Consistent
•Similar convergence to LSCV
•More stable optimization
•Sensitive to outliers
•Preferred for generative modeling

LSCV vs LCV Trade-offs

Comparing Selection Methods

How do the various bandwidth selection methods compare in practice? Extensive simulation studies (e.g., Park & Marron 1990, Jones et al. 1996) have evaluated methods across different density types.

Bandwidth Selector Comparison
Method	Speed	Accuracy (Normal)	Accuracy (Bimodal)	Robustness
Silverman's Rule	Very Fast	Excellent	Poor (oversmooths)	High
Scott's Rule	Very Fast	Excellent	Poor (oversmooths)	Moderate
Sheather-Jones	Fast	Excellent	Good	Moderate
LSCV	Slow	Good	Good	Variable
LCV	Slow	Good	Good	Variable
BCV	Moderate	Good	Good	High

Key Findings from Simulation Studies:

For Unimodal Densities: Reference rules and plug-in methods perform similarly well. Simpler is often better.
For Multimodal Densities: Cross-validation and plug-in methods outperform reference rules, which systematically oversmooth.
For Small Samples (n < 100): All methods have high variability. Reference rules are often as good as anything.
For Large Samples (n > 1000): Plug-in methods show their theoretical advantage in convergence rate.
Sheather-Jones Dominance: Across many scenarios, Sheather-Jones offers the best balance of speed and accuracy.
CV Variability: Cross-validation can be erratic, especially for small samples or smooth densities. Multiple minima in the CV curve cause problems.

Practical Recommendations:

Start with Silverman's robust rule for quick exploration. It's rarely catastrophic.
For publication/final analysis: Use Sheather-Jones (if available) or LSCV.
When you suspect multimodality: Cross-validation or plug-in; avoid reference rules.
When outliers are present: Use robust scale estimate (IQR-based) in reference rules.
When computational cost matters: Reference rules or plug-in (both are fast).
When you have no prior knowledge: Cross-validation makes the fewest assumptions.
Always validate visually: Plot multiple bandwidths spanning 0.5× to 2× the selected value. Features that appear across the range are robust.

The Meta-Advice

Adaptive Bandwidth Methods

Two Philosophies:

Sample-point adaptive: Each data point $x_i$ has its own bandwidth $h_i$. The kernel placed at $x_i$ has width $h_i$.
Design-point adaptive: The bandwidth at evaluation point $x$ depends on local density, $h(x) = h \cdot \lambda(x)$.

Both aim to achieve the same effect: narrower kernels where data is dense, wider kernels where data is sparse.

Abramson's Square-Root Rule (1982):

A widely-used adaptive approach. The bandwidth at point $x_i$ is: $$h_i = h \cdot \left(\frac{\bar{f}}{\tilde{f}(x_i)}\right)^{1/2}$$

where:

$h$ is a global bandwidth (e.g., from Silverman's rule)
$\tilde{f}(x_i)$ is a pilot density estimate at $x_i$ (fixed-bandwidth KDE)
$\bar{f} = \left(\prod_{i=1}^n \tilde{f}(x_i)\right)^{1/n}$ is the geometric mean of pilot densities

Intuition: If $\tilde{f}(x_i)$ is high (dense region), then $h_i$ is small. If $\tilde{f}(x_i)$ is low (sparse region), then $h_i$ is large.

The adaptive KDE is: $$\hat{f}{\text{adapt}}(x) = \frac{1}{n} \sum{i=1}^n \frac{1}{h_i} K\left(\frac{x - x_i}{h_i}\right)$$

Normalization: The geometric mean $\bar{f}$ ensures that $\prod h_i$ remains approximately constant, preserving the overall smoothing level.

Properties of Adaptive KDE:

Advantages:

Better captures peaks and troughs simultaneously
Reduces bias near modes
Reduces variance in tails
Achieves faster convergence for some density classes

Disadvantages:

More complex to implement
Requires pilot estimate (introduces additional tuning)
May not integrate to exactly 1
Harder to analyze theoretically

Balloon Estimator:

A design-point adaptive approach: $$\hat{f}_{\text{balloon}}(x) = \frac{k}{n \cdot V_d \cdot r_k(x)^d}$$

Sample-Smoothing Estimator:

A sample-point adaptive approach where $h_i \propto r_k(x_i)$ (distance to $k$-th nearest neighbor of $x_i$).

When to Use Adaptive Bandwidth

Implementation and Practical Guidance

Armed with theoretical understanding, let's consolidate practical implementation guidance.

bandwidth_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy import stats
from scipy.optimize import minimize_scalar
 
def silverman_bandwidth(data):
    """Silverman's robust rule of thumb."""
    n = len(data)
    sigma = np.std(data, ddof=1)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    scale = min(sigma, iqr / 1.34)
    return 0.9 * scale * n**(-0.2)
 
def scott_bandwidth(data):
    """Scott's rule of thumb."""
    n = len(data)
    sigma = np.std(data, ddof=1)
    return 1.059 * sigma * n**(-0.2)
 
def lscv_bandwidth(data, h_range=None):
    """
    Least-Squares Cross-Validation bandwidth selection.
    
    Returns bandwidth that minimizes estimated ISE.
    """
    data = np.asarray(data)
    n = len(data)
    
    # Default search range: 0.1 to 5 times Silverman's rule
    if h_range is None:
        h_silver = silverman_bandwidth(data)
        h_range = (0.1 * h_silver, 5 * h_silver)
    
    def lscv_score(h):
        """Compute LSCV score for bandwidth h."""
        if h <= 0:
            return np.inf
        
        # First term: integral of f_hat squared
        # For Gaussian kernel: sum_i sum_j K((xi-xj)/(sqrt(2)*h)) / (n^2 * h * sqrt(4*pi))
        diff_matrix = data[:, np.newaxis] - data[np.newaxis, :]
        term1 = np.sum(np.exp(-diff_matrix**2 / (4 * h**2))) / (n**2 * h * np.sqrt(4 * np.pi))
        
        # Second term: 2 * mean of leave-one-out estimates
        # Exclude diagonal (i != j)
        np.fill_diagonal(diff_matrix, np.inf)  # Exclude self
        loo_contributions = np.sum(np.exp(-diff_matrix**2 / (2 * h**2)), axis=1)
        term2 = 2 * np.mean(loo_contributions) / ((n - 1) * h * np.sqrt(2 * np.pi))
        
        return term1 - term2
    
    # Minimize over range using golden section search
    result = minimize_scalar(lscv_score, bounds=h_range, method='bounded')
    return result.x
 
def lcv_bandwidth(data, h_range=None):
    """
    Likelihood Cross-Validation bandwidth selection.
    
    Returns bandwidth that maximizes leave-one-out log-likelihood.
    """
    data = np.asarray(data)
    n = len(data)
    
    if h_range is None:
        h_silver = silverman_bandwidth(data)
        h_range = (0.1 * h_silver, 5 * h_silver)
    
    def neg_lcv_score(h):
        """Negative LCV (for minimization)."""
        if h <= 0:
            return np.inf
        
        diff_matrix = data[:, np.newaxis] - data[np.newaxis, :]
        kernel_values = np.exp(-diff_matrix**2 / (2 * h**2)) / (h * np.sqrt(2 * np.pi))
        np.fill_diagonal(kernel_values, 0)  # Leave-one-out
        
        loo_densities = np.sum(kernel_values, axis=1) / (n - 1)
        
        # Avoid log(0)
        loo_densities = np.maximum(loo_densities, 1e-300)
        
        return -np.sum(np.log(loo_densities))
    
    result = minimize_scalar(neg_lcv_score, bounds=h_range, method='bounded')
    return result.x
 
# Example usage
np.random.seed(42)
# Bimodal data where Silverman will oversmooth
data = np.concatenate([np.random.normal(-2, 0.8, 200), np.random.normal(2, 0.8, 200)])
 
h_silverman = silverman_bandwidth(data)
h_scott = scott_bandwidth(data)
h_lscv = lscv_bandwidth(data)
h_lcv = lcv_bandwidth(data)
 
print("Bandwidth Selection Comparison (Bimodal Data):")
print(f"  Silverman: {h_silverman:.4f}")
print(f"  Scott:     {h_scott:.4f}")
print(f"  LSCV:      {h_lscv:.4f}")
print(f"  LCV:       {h_lcv:.4f}")

Decision Flowchart:

Start
  │
  ├── Is this exploratory analysis?
  │   Yes → Use Silverman's Rule
  │   No ↓
  │
  ├── Do you suspect multiple modes?
  │   Yes → Use Cross-Validation or Plug-in
  │   No ↓
  │
  ├── Is sample size > 500?
  │   Yes → Sheather-Jones Plug-in
  │   No → Silverman's Rule (robust)
  │
  └── Always: Visualize with multiple bandwidths!

Common Mistakes to Avoid:

Using default without checking: Software defaults are sensible but not always appropriate for your data.
Ignoring visual inspection: Numbers can't tell you if modes are real or artifacts.
Over-relying on CV: For small samples, CV can be very noisy.
Reporting single bandwidth: Always report which method was used and sensitivity to choice.
Ignoring domain knowledge: If you know the density should be unimodal (or bimodal, etc.), use that information.

Summary and Path Forward

Bandwidth selection is the critical practical challenge in kernel density estimation. We've explored the major approaches in depth:

Key Takeaways

•The optimal bandwidth depends on $R(f'')$, which involves the unknown density—creating a fundamental circular problem.
•Reference rules (Silverman, Scott) assume a Gaussian reference distribution. They're fast and work well for unimodal data but oversmooth multimodal densities.
•Plug-in methods (Sheather-Jones) estimate $R(f'')$ directly through a cascade of derivative estimates. They offer excellent performance with $O(n \log n)$ complexity.
•Cross-validation (LSCV, LCV) makes no assumptions about $f$ but has slow convergence ($n^{-1/10}$) and can be variable for small samples.
•Adaptive bandwidth varies smoothing locally—narrower near modes, wider in tails—and is valuable for densities with sharp peaks and long tails.
•No method universally dominates. Choice depends on sample size, prior knowledge, computational budget, and whether the goal is exploration or inference.
•Visual validation is essential. Always check results with multiple bandwidths to identify robust features.

What's Next:

Page Complete