Loading content...
If kernel density estimation were a musical instrument, bandwidth would be its tuning—get it right, and the estimate sings with clarity and precision; get it wrong, and the result is either screeching noise (undersmoothing) or muffled silence (oversmoothing).
We established in the previous page that bandwidth profoundly impacts KDE quality: too small produces spiky, overfitted estimates; too large obliterates important features. The optimal bandwidth depends on the unknown true density $f$—creating a fundamental circular problem. How do we select a bandwidth without already knowing what we're trying to estimate?
This page is devoted entirely to answering this question. We'll explore multiple philosophies and methods: quick rule-of-thumb calculations, sophisticated plug-in estimators, principled cross-validation approaches, and adaptive techniques that vary bandwidth locally. Each method embodies different assumptions and tradeoffs, and understanding this landscape is essential for applying KDE effectively in practice.
By the end of this page, you will master multiple bandwidth selection strategies including Silverman's rule, Scott's rule, plug-in estimators, least-squares cross-validation, likelihood cross-validation, and adaptive bandwidth methods. You'll understand when each approach excels, their computational costs, and practical guidance for choosing between them.
Recall the fundamental result for optimal bandwidth. Minimizing the Mean Integrated Squared Error (MISE) yields:
$$h^* = \left(\frac{R(K)}{n [\mu_2(K)]^2 R(f'')}\right)^{1/5}$$
where:
The problem is clear: the optimal bandwidth depends on $R(f'')$, which involves the unknown density $f$. This is not merely inconvenient—it's a fundamental epistemological challenge. Different approaches to bandwidth selection correspond to different strategies for breaking this circularity.
| Strategy | Approach | Assumptions | Complexity |
|---|---|---|---|
| Reference/Rule-of-Thumb | Assume f is from known family | Usually Gaussian | O(n) |
| Plug-in Methods | Estimate R(f'') iteratively | Smoothness of f | O(n log n) |
| Cross-Validation | Minimize estimation error estimate | None on f shape | O(n² log n) |
| Bayesian | Place prior on h, compute posterior | Prior specification | Variable |
| Adaptive | Let h vary with x | Local behavior differs | O(n²) |
Evaluation Criteria for Bandwidth Selectors:
How do we judge a bandwidth selector? Key criteria include:
Accuracy: How close is selected $\hat{h}$ to optimal $h^*$?
Consistency: Does $\hat{h}/h^* \to 1$ as $n \to \infty$?
Convergence Rate: How fast does $\hat{h}$ converge to $h^*$?
Robustness: Performance across different true densities?
Computational Cost: Time complexity as function of $n$?
Simplicity: Ease of implementation and understanding?
No method dominates on all criteria—hence the proliferation of approaches.
Reference rules assume the true density belongs to a specific family (typically Gaussian), compute the optimal bandwidth for that reference, and substitute sample estimates. They're simple, fast, and often sufficient for exploratory analysis.
Silverman's Rule (1986):
For a Gaussian density $f = N(\mu, \sigma^2)$: $$R(f'') = \frac{3}{8\sqrt{\pi}\sigma^5}$$
With Gaussian kernel ($R(K) = 1/(2\sqrt{\pi})$, $\mu_2(K) = 1$):
$$h^*_{\text{Gauss}} = \left(\frac{4\sigma^5}{3n}\right)^{1/5} = \left(\frac{4}{3n}\right)^{1/5} \sigma \approx 1.06 \cdot \sigma \cdot n^{-1/5}$$
Robust Version (Default in most software): $$h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$$
The factor 0.9 adjusts for using the minimum of two scale estimates (slightly lower than 1.06 to prevent oversmoothing). The IQR/1.34 equals $\sigma$ for Gaussian but is robust to outliers.
Scott's Rule (1992):
$$h = 1.059 \cdot \hat{\sigma} \cdot n^{-1/5}$$
Essentially the same as Silverman's non-robust version. Sometimes used with the coefficient rounded to 1.06.
Comparison of Coefficients:
For Gaussian kernel and Gaussian reference:
For other kernels:
The coefficient changes based on $R(K)$ and $\mu_2(K)$. For Epanechnikov kernel: $$h = 2.34 \cdot \hat{\sigma} \cdot n^{-1/5}$$
(The larger coefficient reflects the kernel's shape, not that it requires wider bandwidth.)
Reference rules assume unimodality and approximate symmetry. For multimodal densities, they systematically oversmooth—merging separate modes into one. For heavily skewed or heavy-tailed distributions, they may also perform poorly. Always visualize results with multiple bandwidths when the data may not be Gaussian-like.
Oversmoothing Rule:
Sheather and Jones (1991) proposed using an oversmoothed reference to prevent undersmoothing: $$h_{\text{os}} = \left(\frac{243 R(K)}{35 [\mu_2(K)]^2 n}\right)^{1/5} \hat{\sigma}$$
This is designed to be larger than optimal, ensuring that features aren't spuriously created. One can then reduce from this starting point.
Multivariate Reference Rules:
For $d$-dimensional data with bandwidth matrix $H \propto \widehat{\Sigma}$:
Scott's Rule: $$H = n^{-2/(d+4)} \widehat{\Sigma}$$
Silverman's Rule: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \widehat{\Sigma}$$
These become increasingly unreliable as $d$ grows, because the Gaussian reference becomes less representative of complex multivariate structure.
Plug-in methods estimate $R(f'') = \int [f''(x)]^2 dx$ directly from the data, then substitute ("plug in") this estimate into the optimal bandwidth formula. The key insight is that $R(f'')$ can itself be estimated by kernel methods.
The Idea:
To estimate $R(f'') = \int [f''(x)]^2 dx$, we can:
But the kernel estimator of $f''$ has its own bandwidth $g$, which itself depends on $R(f^{(4)})$. This creates a hierarchy of estimation problems.
Sheather-Jones Plug-in (1991):
The most sophisticated and widely-used plug-in method. It solves the bandwidth selection problem by:
Mathematical Details:
To estimate $\psi_r = \int f^{(r)}(x)^2 dx$ (where $f^{(r)}$ is the $r$-th derivative), use:
$$\hat{\psi}r = \frac{1}{n^2 g^{r+1}} \sum{i=1}^n \sum_{j=1}^n K^{(r)}\left(\frac{x_i - x_j}{g}\right)$$
where $K^{(r)}$ is the $r$-th derivative of the kernel.
The bandwidth $g$ for estimating $\psi_r$ depends on $\psi_{r+2}$. This creates the hierarchy:
$$g_4 = C_4(K) \cdot \psi_6^{-1/(5+6)} \cdot n^{-1/(5+6)}$$ $$g_2 = C_2(K) \cdot \psi_4^{-1/(5+4)} \cdot n^{-1/(5+4)}$$ $$h = C_0(K) \cdot \psi_2^{-1/(5+2)} \cdot n^{-1/(5+2)}$$
where $C_r(K)$ are constants depending on the kernel.
The chain is stopped by using a Gaussian reference for the highest derivative.
Sheather-Jones is implemented in most statistical software (e.g., stats::bw.SJ in R, scipy.stats.gaussian_kde with method='scott' or 'silverman' in Python). It generally outperforms simple reference rules, especially for non-normal densities, and is considered the state-of-the-art for plug-in bandwidth selection.
Properties of Plug-in Methods:
Advantages:
Disadvantages:
Direct Plug-in (DPI) Variant:
Simpler than Sheather-Jones, stops the hierarchy earlier:
Slightly less accurate but easier to implement.
| Method | Number of Stages | Convergence Rate | Complexity |
|---|---|---|---|
| Normal Reference (0-stage) | 0 | Biased for non-normal | O(n) |
| Direct Plug-in (1-stage) | 1 | O(n⁻⁵/¹⁴) | O(n log n) |
| Sheather-Jones (2-stage) | 2 | O(n⁻⁵/¹⁴) | O(n log n) |
| 3-stage | 3 | O(n⁻⁵/¹⁴) | O(n log n) |
Cross-validation (CV) methods select bandwidth by optimizing an estimate of the density estimator's error that doesn't require knowing the true density. Unlike plug-in methods, CV makes no assumptions about the shape of $f$—it is fully nonparametric.
Least-Squares Cross-Validation (LSCV):
Also called Unbiased Cross-Validation (UCV). We want to minimize the Integrated Squared Error: $$\text{ISE}(h) = \int [\hat{f}_h(x) - f(x)]^2 dx = \int \hat{f}_h^2 - 2\int \hat{f}_h f + \int f^2$$
The last term $\int f^2$ doesn't depend on $h$, so we minimize: $$J(h) = \int \hat{f}_h^2 - 2\int \hat{f}_h f$$
The first integral can be computed directly. The second term $\int \hat{f}_h f = \mathbb{E}_f[\hat{f}_h(X)]$ can be estimated by leave-one-out:
$$\widehat{\int \hat{f}h f} = \frac{1}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i)$$
where $\hat{f}_{h,-i}(x_i)$ is the KDE at $x_i$ using all points except $x_i$.
LSCV Score:
$$\text{LSCV}(h) = \int \hat{f}h^2(x) dx - \frac{2}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i)$$
For Gaussian kernel, both terms have closed-form expressions:
$$\int \hat{f}h^2 dx = \frac{1}{n^2 h \sqrt{4\pi}} \sum{i=1}^n \sum_{j=1}^n \exp\left(-\frac{(x_i - x_j)^2}{4h^2}\right)$$
$$\hat{f}{h,-i}(x_i) = \frac{1}{(n-1)h\sqrt{2\pi}} \sum{j eq i} \exp\left(-\frac{(x_i - x_j)^2}{2h^2}\right)$$
Minimize LSCV$(h)$ over a grid of $h$ values.
Properties of LSCV:
Biased Cross-Validation (BCV):
A variant that reduces the variability of LSCV at the cost of some bias:
$$\text{BCV}(h) = \int \hat{f}h^2 + \frac{R(K)}{nh} - \frac{2}{n} \sum{i=1}^n \hat{f}_{h,-i}(x_i) - \frac{2K(0)}{nh}$$
The additional terms stabilize the objective function, reducing variance but introducing small bias.
Likelihood Cross-Validation (LCV):
Instead of minimizing ISE, maximize the leave-one-out log-likelihood:
$$\text{LCV}(h) = \sum_{i=1}^n \log \hat{f}_{h,-i}(x_i)$$
This measures how well the density predicts held-out points—maximizing average log-density of truly sampled points.
Equivalent form (Pseudo-likelihood): $$\text{LCV}(h) = n \cdot \frac{1}{n}\sum_{i=1}^n \log \hat{f}_{h,-i}(x_i)$$
Related to minimizing KL divergence $D_{KL}(f | \hat{f})$ rather than ISE.
LSCV targets L² error (ISE) while LCV targets KL divergence. For visualization, they often give similar results. For generative modeling (sampling from the density), LCV is more natural. Both can suffer from high variance; using wide search grids and smooth optimization helps.
How do the various bandwidth selection methods compare in practice? Extensive simulation studies (e.g., Park & Marron 1990, Jones et al. 1996) have evaluated methods across different density types.
| Method | Speed | Accuracy (Normal) | Accuracy (Bimodal) | Robustness |
|---|---|---|---|---|
| Silverman's Rule | Very Fast | Excellent | Poor (oversmooths) | High |
| Scott's Rule | Very Fast | Excellent | Poor (oversmooths) | Moderate |
| Sheather-Jones | Fast | Excellent | Good | Moderate |
| LSCV | Slow | Good | Good | Variable |
| LCV | Slow | Good | Good | Variable |
| BCV | Moderate | Good | Good | High |
Key Findings from Simulation Studies:
For Unimodal Densities: Reference rules and plug-in methods perform similarly well. Simpler is often better.
For Multimodal Densities: Cross-validation and plug-in methods outperform reference rules, which systematically oversmooth.
For Small Samples (n < 100): All methods have high variability. Reference rules are often as good as anything.
For Large Samples (n > 1000): Plug-in methods show their theoretical advantage in convergence rate.
Sheather-Jones Dominance: Across many scenarios, Sheather-Jones offers the best balance of speed and accuracy.
CV Variability: Cross-validation can be erratic, especially for small samples or smooth densities. Multiple minima in the CV curve cause problems.
Practical Recommendations:
Start with Silverman's robust rule for quick exploration. It's rarely catastrophic.
For publication/final analysis: Use Sheather-Jones (if available) or LSCV.
When you suspect multimodality: Cross-validation or plug-in; avoid reference rules.
When outliers are present: Use robust scale estimate (IQR-based) in reference rules.
When computational cost matters: Reference rules or plug-in (both are fast).
When you have no prior knowledge: Cross-validation makes the fewest assumptions.
Always validate visually: Plot multiple bandwidths spanning 0.5× to 2× the selected value. Features that appear across the range are robust.
No bandwidth selector is universally best. The 'right' choice depends on the goal (exploration vs inference), sample size, computational budget, and prior knowledge. When in doubt, try multiple methods and compare results. Agreement suggests robustness; disagreement warrants investigation.
Fixed-bandwidth KDE uses the same smoothing everywhere. But optimal smoothing may vary: high-density regions (modes) benefit from smaller bandwidth (capture detail), while low-density regions (tails) need larger bandwidth (reduce variance). Adaptive (variable) bandwidth methods address this.
Two Philosophies:
Sample-point adaptive: Each data point $x_i$ has its own bandwidth $h_i$. The kernel placed at $x_i$ has width $h_i$.
Design-point adaptive: The bandwidth at evaluation point $x$ depends on local density, $h(x) = h \cdot \lambda(x)$.
Both aim to achieve the same effect: narrower kernels where data is dense, wider kernels where data is sparse.
Abramson's Square-Root Rule (1982):
A widely-used adaptive approach. The bandwidth at point $x_i$ is: $$h_i = h \cdot \left(\frac{\bar{f}}{\tilde{f}(x_i)}\right)^{1/2}$$
where:
Intuition: If $\tilde{f}(x_i)$ is high (dense region), then $h_i$ is small. If $\tilde{f}(x_i)$ is low (sparse region), then $h_i$ is large.
The adaptive KDE is: $$\hat{f}{\text{adapt}}(x) = \frac{1}{n} \sum{i=1}^n \frac{1}{h_i} K\left(\frac{x - x_i}{h_i}\right)$$
Normalization: The geometric mean $\bar{f}$ ensures that $\prod h_i$ remains approximately constant, preserving the overall smoothing level.
Properties of Adaptive KDE:
Advantages:
Disadvantages:
Balloon Estimator:
A design-point adaptive approach: $$\hat{f}_{\text{balloon}}(x) = \frac{k}{n \cdot V_d \cdot r_k(x)^d}$$
where $r_k(x)$ is the distance to the $k$-th nearest neighbor of $x$, and $V_d$ is the volume of a unit $d$-sphere. Larger $r_k$ (sparse region) gives smaller density. This is k-NN density estimation.
Sample-Smoothing Estimator:
A sample-point adaptive approach where $h_i \propto r_k(x_i)$ (distance to $k$-th nearest neighbor of $x_i$).
Adaptive bandwidth is most valuable when the density has sharp features (peaks) and long tails simultaneously—e.g., income distributions, insurance claims. For roughly Gaussian densities, the benefit is minimal. The added complexity is justified when fixed-bandwidth clearly fails to capture structure.
Armed with theoretical understanding, let's consolidate practical implementation guidance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom scipy import statsfrom scipy.optimize import minimize_scalar def silverman_bandwidth(data): """Silverman's robust rule of thumb.""" n = len(data) sigma = np.std(data, ddof=1) iqr = np.percentile(data, 75) - np.percentile(data, 25) scale = min(sigma, iqr / 1.34) return 0.9 * scale * n**(-0.2) def scott_bandwidth(data): """Scott's rule of thumb.""" n = len(data) sigma = np.std(data, ddof=1) return 1.059 * sigma * n**(-0.2) def lscv_bandwidth(data, h_range=None): """ Least-Squares Cross-Validation bandwidth selection. Returns bandwidth that minimizes estimated ISE. """ data = np.asarray(data) n = len(data) # Default search range: 0.1 to 5 times Silverman's rule if h_range is None: h_silver = silverman_bandwidth(data) h_range = (0.1 * h_silver, 5 * h_silver) def lscv_score(h): """Compute LSCV score for bandwidth h.""" if h <= 0: return np.inf # First term: integral of f_hat squared # For Gaussian kernel: sum_i sum_j K((xi-xj)/(sqrt(2)*h)) / (n^2 * h * sqrt(4*pi)) diff_matrix = data[:, np.newaxis] - data[np.newaxis, :] term1 = np.sum(np.exp(-diff_matrix**2 / (4 * h**2))) / (n**2 * h * np.sqrt(4 * np.pi)) # Second term: 2 * mean of leave-one-out estimates # Exclude diagonal (i != j) np.fill_diagonal(diff_matrix, np.inf) # Exclude self loo_contributions = np.sum(np.exp(-diff_matrix**2 / (2 * h**2)), axis=1) term2 = 2 * np.mean(loo_contributions) / ((n - 1) * h * np.sqrt(2 * np.pi)) return term1 - term2 # Minimize over range using golden section search result = minimize_scalar(lscv_score, bounds=h_range, method='bounded') return result.x def lcv_bandwidth(data, h_range=None): """ Likelihood Cross-Validation bandwidth selection. Returns bandwidth that maximizes leave-one-out log-likelihood. """ data = np.asarray(data) n = len(data) if h_range is None: h_silver = silverman_bandwidth(data) h_range = (0.1 * h_silver, 5 * h_silver) def neg_lcv_score(h): """Negative LCV (for minimization).""" if h <= 0: return np.inf diff_matrix = data[:, np.newaxis] - data[np.newaxis, :] kernel_values = np.exp(-diff_matrix**2 / (2 * h**2)) / (h * np.sqrt(2 * np.pi)) np.fill_diagonal(kernel_values, 0) # Leave-one-out loo_densities = np.sum(kernel_values, axis=1) / (n - 1) # Avoid log(0) loo_densities = np.maximum(loo_densities, 1e-300) return -np.sum(np.log(loo_densities)) result = minimize_scalar(neg_lcv_score, bounds=h_range, method='bounded') return result.x # Example usagenp.random.seed(42)# Bimodal data where Silverman will oversmoothdata = np.concatenate([np.random.normal(-2, 0.8, 200), np.random.normal(2, 0.8, 200)]) h_silverman = silverman_bandwidth(data)h_scott = scott_bandwidth(data)h_lscv = lscv_bandwidth(data)h_lcv = lcv_bandwidth(data) print("Bandwidth Selection Comparison (Bimodal Data):")print(f" Silverman: {h_silverman:.4f}")print(f" Scott: {h_scott:.4f}")print(f" LSCV: {h_lscv:.4f}")print(f" LCV: {h_lcv:.4f}")Decision Flowchart:
Start
│
├── Is this exploratory analysis?
│ Yes → Use Silverman's Rule
│ No ↓
│
├── Do you suspect multiple modes?
│ Yes → Use Cross-Validation or Plug-in
│ No ↓
│
├── Is sample size > 500?
│ Yes → Sheather-Jones Plug-in
│ No → Silverman's Rule (robust)
│
└── Always: Visualize with multiple bandwidths!
Common Mistakes to Avoid:
Using default without checking: Software defaults are sensible but not always appropriate for your data.
Ignoring visual inspection: Numbers can't tell you if modes are real or artifacts.
Over-relying on CV: For small samples, CV can be very noisy.
Reporting single bandwidth: Always report which method was used and sensitivity to choice.
Ignoring domain knowledge: If you know the density should be unimodal (or bimodal, etc.), use that information.
Bandwidth selection is the critical practical challenge in kernel density estimation. We've explored the major approaches in depth:
What's Next:
With univariate density estimation fully explored, we turn to the challenge of multivariate density estimation. The next page addresses the curse of dimensionality head-on, exploring techniques for estimating joint distributions and understanding how sample requirements scale with dimension.
You now have comprehensive mastery of bandwidth selection for kernel density estimation. From quick rule-of-thumb calculations to sophisticated plug-in and cross-validation methods, you can select bandwidths appropriately for any application and understand the tradeoffs involved.