Loading learning content...
In the previous section, we established that the optimal bandwidth depends on $R(f'') = \int (f''(x))^2 dx$—a quantity that requires knowing the very density we're trying to estimate. This creates a fundamental circularity at the heart of kernel density estimation.
How do we break this circularity? Several ingenious approaches have been developed over decades of research. Each makes different assumptions and tradeoffs, and understanding them is essential for practitioners who want to move beyond default settings.
This page provides a comprehensive treatment of bandwidth selection methods, from simple rules of thumb to sophisticated data-driven techniques. We'll examine both the theory and practical implementation of each approach, giving you the tools to make informed choices for your specific applications.
By the end of this page, you will understand the reference rules of Silverman and Scott, plug-in methods that estimate $R(f'')$ from data, least-squares cross-validation (LSCV), biased cross-validation (BCV), the Sheather-Jones method (often considered the gold standard), and how to choose among these methods for different applications.
The simplest approach to bandwidth selection is to assume the underlying density belongs to a known family (typically Gaussian) and derive the optimal bandwidth for that reference distribution. These "rules of thumb" require only basic sample statistics and are computationally trivial.
1.1 Silverman's Rule of Thumb
Silverman (1986) derived what has become the most widely used reference rule. Assuming the true density is Gaussian and using the Gaussian kernel:
$$h_{\text{Silverman}} = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{1/5} = 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$
where $\hat{\sigma}$ is the sample standard deviation.
Derivation:
For a $N(\mu, \sigma^2)$ density: $$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
$$f''(x) = \frac{1}{\sigma^3\sqrt{2\pi}} \left(\frac{(x-\mu)^2}{\sigma^2} - 1\right) \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Computing $R(f'') = \int (f''(x))^2 dx$ yields: $$R(f'') = \frac{3}{8\sigma^5\sqrt{\pi}}$$
Substituting into the AMISE-optimal bandwidth formula with the Gaussian kernel gives the result.
Silverman also proposed a robust modification that uses the minimum of the standard deviation and a scaled interquartile range: $h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$. The factor 1.34 is the IQR of the standard normal. This provides protection against outliers and heavy tails.
1.2 Scott's Rule
Scott (1992) derived a similar rule with slightly different constant:
$$h_{\text{Scott}} = 3.49 \cdot \hat{\sigma} \cdot n^{-1/3}$$
Note: Scott's rule is actually designed for histograms (hence the $n^{-1/3}$ rate) and is sometimes adapted for KDE. For KDE specifically, the Silverman rule is more appropriate.
1.3 Comparison of Reference Rules:
| Rule | Formula | $n^{-1/5}$ coefficient (using $\sigma$) |
|---|---|---|
| Silverman (standard) | $1.06 \sigma n^{-1/5}$ | 1.06 |
| Silverman (robust) | $0.9 \min(\sigma, \text{IQR}/1.34) n^{-1/5}$ | 0.9 |
| Normal reference (exact) | $1.059 \sigma n^{-1/5}$ | 1.059 |
When do reference rules work well?
Reference rules tend to oversmooth for multimodal or skewed densities. The Gaussian reference has the smallest $R(f'')$ among densities with the same variance, so any departure from Gaussian typically means the true optimal bandwidth should be smaller. For bimodal distributions, oversmoothing can be severe—merging two distinct modes into one broad hump.
Plug-in methods directly address the circularity problem by estimating the unknown functional $R(f'')$ from the data, then 'plugging in' this estimate to the optimal bandwidth formula.
The General Strategy:
But wait—there's more circularity!
To estimate $R(f'')$, we need to estimate $f''$. The natural approach is to use a KDE of a derivative, which itself requires a bandwidth. This creates a cascade of bandwidths, each requiring estimation of higher-order functionals.
2.1 Estimating Density Derivatives
The $r$-th derivative of a KDE can be estimated as:
$$\hat{f}^{(r)}g(x) = \frac{1}{ng^{r+1}} \sum{i=1}^{n} K^{(r)}\left(\frac{x - X_i}{g}\right)$$
where $K^{(r)}$ is the $r$-th derivative of the kernel and $g$ is a 'pilot' bandwidth.
For estimating $R(f'') = \int (f''(x))^2 dx$, we need the second derivative:
$$\hat{R}(f'') = \int (\hat{f}''_g(x))^2 dx$$
2.2 The Bandwidth Cascade
The optimal pilot bandwidth $g$ for estimating $f''$ itself depends on $R(f^{(4)})$—the roughness of the fourth derivative. Estimating this requires another pilot bandwidth for $f^{(4)}$, which depends on $R(f^{(6)})$, and so on.
Breaking the cascade:
Two-stage plug-in: Use the normal reference to estimate the highest-order functional needed, then work backwards.
Sheather-Jones (solve-the-equation): Formulate the problem as a fixed-point equation and solve iteratively.
2.3 Two-Stage Plug-in (Park and Marron)
A practical two-stage procedure:
Stage 1: Use normal reference to estimate $R(f^{(4)})$: $$\hat{R}(f^{(4)}) = \frac{105}{32\sqrt{\pi}\hat{\sigma}^9}$$
Stage 2: Use this to compute optimal pilot bandwidth for estimating $f''$: $$g = \left(\frac{-2K^{(4)}(0)}{\mu_2(K) \hat{R}(f^{(4)}) n}\right)^{1/7}$$
Stage 3: Estimate $R(f'')$ using pilot bandwidth $g$: $$\hat{R}(f'') = \frac{1}{n^2 g^5} \sum_{i=1}^{n} \sum_{j=1}^{n} K^{(4)}\left(\frac{X_i - X_j}{g}\right)$$
Stage 4: Compute final bandwidth: $$h = \left(\frac{R(K)}{n \mu_2(K)^2 \hat{R}(f'')}\right)^{1/5}$$
Two-stage plug-in estimators achieve an asymptotic relative efficiency (ARE) that converges to 1—they achieve the optimal bandwidth asymptotically. The rate of convergence of $\hat{h}$ to $h_{\text{opt}}$ is $n^{-5/14}$ for well-designed two-stage methods. This is fast enough that the excess MISE from suboptimal bandwidth selection is asymptotically negligible.
The Sheather-Jones (SJ) method, published in 1991, is widely considered the gold standard for automatic bandwidth selection. It uses a 'solve-the-equation' approach that elegantly handles the cascade problem.
The Key Insight:
Instead of using a fixed pilot bandwidth, SJ recognizes that the optimal pilot bandwidth depends on the optimal final bandwidth (and vice versa). This leads to a fixed-point equation that can be solved iteratively.
Mathematical Formulation:
Define the AMISE-optimal bandwidth as a function of $R(f'')$: $$h^* = \left(\frac{R(K)}{n \mu_2(K)^2 R(f'')}\right)^{1/5}$$
The estimated $R(f'')$ depends on a pilot bandwidth $g$, which itself is a function of $h$ (through the optimal relationship for derivative estimation): $$g = c_1 \cdot h^{5/7}$$
Combining these gives an implicit equation for $h$: $$h = \left(\frac{R(K)}{n \mu_2(K)^2 \hat{R}(f''; c_1 h^{5/7})}\right)^{1/5}$$
This can be written as a fixed-point equation: $h = T(h)$, where $T$ encapsulates the right-hand side.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom scipy import optimizefrom scipy.stats import norm def sheather_jones_bandwidth(data): """ Implements the Sheather-Jones 'solve-the-equation' bandwidth selector. This is considered the gold standard for automatic bandwidth selection in kernel density estimation. Parameters: ----------- data : array-like Sample data for which to compute optimal bandwidth Returns: -------- float Optimal bandwidth according to Sheather-Jones method """ n = len(data) # Standardize data for numerical stability std = np.std(data, ddof=1) data_std = (data - np.mean(data)) / std # Constants for Gaussian kernel R_K = 1 / (2 * np.sqrt(np.pi)) # Roughness of Gaussian kernel mu_2_K = 1.0 # Second moment of Gaussian kernel def phi6(x): """6th derivative of standard normal density at 0.""" return norm.pdf(x) * (x**6 - 15*x**4 + 45*x**2 - 15) def phi4(x): """4th derivative of standard normal density at 0.""" return norm.pdf(x) * (x**4 - 6*x**2 + 3) def estimate_R_derivative(data, r, g): """ Estimate R(f^(r)) = integral of (f^(r)(x))^2 dx using kernel derivative estimation with pilot bandwidth g. """ n = len(data) total = 0.0 for i in range(n): for j in range(n): diff = (data[i] - data[j]) / g if r == 4: total += phi4(diff) elif r == 6: total += phi6(diff) return total / (n**2 * g**(2*r + 1)) def sj_equation(h): """ The Sheather-Jones fixed-point equation. We need to find h such that this returns 0. """ # Pilot bandwidth for estimating R(f^(4)) # using optimal relationship from derivative estimation lambda_val = (8 * np.sqrt(np.pi) * estimate_R_derivative( data_std, 4, (2 * norm.pdf(0, scale=1) / (mu_2_K * n))**(1/7) )) ** (1/5) # Pilot bandwidth for R(f'') g = lambda_val * h**(5/7) # Estimate R(f'') R_f_prime_prime = estimate_R_derivative(data_std, 4, g) # Target h from AMISE optimal formula h_target = (R_K / (n * mu_2_K**2 * abs(R_f_prime_prime)))**(1/5) return h - h_target # Initial guess using Silverman's rule h_init = 1.06 * n**(-1/5) # Solve the fixed-point equation try: h_opt = optimize.brentq(sj_equation, h_init/10, h_init*10) except ValueError: # Fallback to Silverman if SJ fails h_opt = h_init # Rescale back to original data scale return h_opt * std # Example usagenp.random.seed(42) # Unimodal datanormal_data = np.random.normal(0, 1, 500)h_sj_normal = sheather_jones_bandwidth(normal_data)print(f"Normal data - SJ bandwidth: {h_sj_normal:.4f}")print(f"Normal data - Silverman bandwidth: {1.06 * np.std(normal_data) * len(normal_data)**(-0.2):.4f}") # Bimodal data (where SJ excels over rule of thumb)bimodal_data = np.concatenate([ np.random.normal(-2, 0.5, 250), np.random.normal(2, 0.5, 250)])h_sj_bimodal = sheather_jones_bandwidth(bimodal_data)print(f"\nBimodal data - SJ bandwidth: {h_sj_bimodal:.4f}")print(f"Bimodal data - Silverman bandwidth: {1.06 * np.std(bimodal_data) * len(bimodal_data)**(-0.2):.4f}")The Sheather-Jones method is available in most statistical software: In R, use bw.SJ(). In Python's scipy, use scipy.stats.gaussian_kde with bw_method='scott' or implement SJ manually. In MATLAB, use ksdensity with the 'Bandwidth' option set via custom code. SJ is generally preferred over simple rules of thumb whenever multimodality is suspected.
Cross-validation (CV) approaches select the bandwidth that minimizes some estimated prediction error. Unlike plug-in methods that estimate theoretical quantities, CV methods directly use the data to evaluate bandwidth quality.
4.1 Least-Squares Cross-Validation (LSCV)
LSCV, also known as Unbiased Cross-Validation (UCV), directly targets the ISE (Integrated Squared Error):
$$\text{ISE}(h) = \int (\hat{f}_h(x) - f(x))^2 dx = \int \hat{f}_h^2(x) dx - 2\int \hat{f}_h(x) f(x) dx + \int f^2(x) dx$$
Since the last term doesn't depend on $h$, we minimize:
$$\text{CV}(h) = \int \hat{f}h^2(x) dx - 2 \cdot \frac{1}{n} \sum{i=1}^{n} \hat{f}_{h,-i}(X_i)$$
where $\hat{f}_{h,-i}$ is the leave-one-out KDE excluding observation $i$.
The LSCV criterion:
$$\text{LSCV}(h) = \frac{1}{n^2 h} \sum_{i=1}^{n} \sum_{j=1}^{n} \bar{K}\left(\frac{X_i - X_j}{h}\right) - \frac{2}{n(n-1)h} \sum_{i=1}^{n} \sum_{j \neq i} K\left(\frac{X_i - X_j}{h}\right)$$
where $\bar{K}(u) = K * K(u) = \int K(t)K(u-t)dt$ is the convolution of $K$ with itself.
4.2 Biased Cross-Validation (BCV)
BCV estimates the AMISE rather than the ISE, replacing the unknown $R(f'')$ with a kernel estimate:
$$\text{BCV}(h) = \frac{R(K)}{nh} + \frac{h^4 \mu_2(K)^2}{4} \hat{R}(f'')$$
where $\hat{R}(f'')$ is estimated using a kernel derivative estimator.
BCV is essentially a plug-in method reformulated as a CV criterion.
4.3 Comparison of CV Methods:
| Method | Targets | Variability | Bias |
|---|---|---|---|
| LSCV | ISE | High | Unbiased |
| BCV | AMISE | Lower | Slightly biased |
LSCV pros and cons:
✓ Unbiased estimation of ISE ✓ Works well for rough, multimodal densities ✓ No parametric assumptions ✗ High variability in selected $h$ ✗ Can severely undersmooth (select $h$ too small) ✗ CV criterion often has multiple local minima
LSCV is known to have high variance in bandwidth selection. The CV criterion is often quite flat near the optimum, and different samples from the same distribution can yield very different selected bandwidths. For this reason, plug-in methods (especially Sheather-Jones) are often preferred for unimodal or mildly multimodal densities, while LSCV may be preferred for heavily multimodal densities where plug-in methods' normal reference assumptions fail badly.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npfrom scipy import optimizefrom scipy.stats import norm def lscv_bandwidth(data, h_range=None): """ Least-Squares Cross-Validation bandwidth selector. Minimizes an unbiased estimate of the Integrated Squared Error. Parameters: ----------- data : array-like Sample data h_range : tuple, optional (h_min, h_max) range to search over Returns: -------- float LSCV-optimal bandwidth """ n = len(data) std = np.std(data, ddof=1) # Default search range based on Silverman's rule h_silverman = 1.06 * std * n**(-0.2) if h_range is None: h_range = (h_silverman / 10, h_silverman * 5) def lscv_criterion(h): """ Compute the LSCV criterion for bandwidth h. """ # Term 1: integral of f_hat^2 # This uses the convolution formula term1 = 0.0 for i in range(n): for j in range(n): diff = (data[i] - data[j]) / h # K_bar(u) for Gaussian is N(0, 2) evaluated at u term1 += norm.pdf(diff, scale=np.sqrt(2)) term1 /= (n**2 * h) # Term 2: leave-one-out prediction term2 = 0.0 for i in range(n): for j in range(n): if i != j: diff = (data[i] - data[j]) / h term2 += norm.pdf(diff) term2 *= 2 / (n * (n - 1) * h) return term1 - term2 # Minimize LSCV over h_range result = optimize.minimize_scalar( lscv_criterion, bounds=h_range, method='bounded' ) return result.x # Example: Compare LSCV with Silverman for different distributionsnp.random.seed(42) # Test on unimodal dataunimodal = np.random.normal(0, 1, 300)h_lscv_uni = lscv_bandwidth(unimodal)h_silv_uni = 1.06 * np.std(unimodal) * len(unimodal)**(-0.2)print(f"Unimodal: LSCV = {h_lscv_uni:.4f}, Silverman = {h_silv_uni:.4f}") # Test on bimodal data (LSCV should find smaller h)bimodal = np.concatenate([ np.random.normal(-2, 0.5, 150), np.random.normal(2, 0.5, 150)])h_lscv_bi = lscv_bandwidth(bimodal)h_silv_bi = 1.06 * np.std(bimodal) * len(bimodal)**(-0.2)print(f"Bimodal: LSCV = {h_lscv_bi:.4f}, Silverman = {h_silv_bi:.4f}") # Test on strongly multimodal (claw-like)claw = np.concatenate([ np.random.normal(0, 1, 200), np.random.normal(-1, 0.1, 50), np.random.normal(-0.5, 0.1, 50), np.random.normal(0, 0.1, 50), np.random.normal(0.5, 0.1, 50), np.random.normal(1, 0.1, 50)])h_lscv_claw = lscv_bandwidth(claw)h_silv_claw = 1.06 * np.std(claw) * len(claw)**(-0.2)print(f"Claw-like: LSCV = {h_lscv_claw:.4f}, Silverman = {h_silv_claw:.4f}")An alternative to squared error-based CV is likelihood-based cross-validation, which maximizes a cross-validated log-likelihood.
Maximum Likelihood CV (MLCV):
The log-likelihood of the KDE is: $$\ell(h) = \sum_{i=1}^{n} \log \hat{f}_h(X_i)$$
But this is cheating—we're evaluating the density at the points used to construct it. Instead, use leave-one-out:
$$\text{MLCV}(h) = \sum_{i=1}^{n} \log \hat{f}_{h,-i}(X_i)$$
where $\hat{f}_{h,-i}$ is the KDE leaving out observation $i$.
Interpretation:
MLCV chooses the bandwidth that best predicts each observation from its neighbors. This is related to minimizing the Kullback-Leibler divergence from the true density to the estimate.
Comparison with LSCV:
| Aspect | LSCV | MLCV |
|---|---|---|
| Targets | ISE / L² error | KL divergence |
| Behavior near zero | Penalizes zeros moderately | Penalizes zeros severely |
| For tail estimation | More stable | Highly variable |
| Common use | Preferred for visualization | Preferred for probabilistic models |
MLCV has a severe problem when $\hat{f}_{h,-i}(X_i) = 0$ for any observation (log of zero is undefined). This typically happens with small bandwidths and compact kernels, or in low-density regions. The Gaussian kernel avoids this issue since it's always positive, but MLCV can still be unstable when the estimate is very small at some observations.
With multiple bandwidth selection methods available, practitioners need guidance on when to use each approach. The choice depends on the data characteristics, computational constraints, and the specific application.
Summary of Methods:
| Method | Speed | Multimodal | Small n | Recommended Use |
|---|---|---|---|---|
| Silverman's rule | Fastest | Poor | OK | Quick initial estimate; unimodal data |
| Silverman (robust) | Fastest | Fair | Good | Outliers present; quick check |
| Two-stage plug-in | Fast | Good | Good | General purpose; automated pipelines |
| Sheather-Jones | Medium | Very good | Good | Default recommendation; most situations |
| LSCV | Slow | Excellent | High variance | Strongly multimodal; research applications |
| MLCV | Slow | Good | Unstable | Probabilistic applications |
Step 1: Start with Silverman's rule to get a quick baseline. Step 2: Apply Sheather-Jones for a more refined estimate. Step 3: If SJ and Silverman differ substantially, the density is likely non-normal—consider LSCV or visualize with multiple bandwidths. Step 4: Always plot the KDE with your selected bandwidth and ±50% to check robustness of conclusions.
Different statistical packages implement different default bandwidth selectors. Knowing what your software uses by default is essential for reproducibility.
R:
density() default: bw.nrd0 (Silverman's rule)bw.nrd, bw.ucv (LSCV), bw.bcv (BCV), bw.SJ (Sheather-Jones)bw.SJ explicitly for best resultsPython (scipy):
scipy.stats.gaussian_kde default: Scott's ruleKDEpy packagePython (statsmodels):
statsmodels.nonparametric.kde.KDEUnivariatePython (KDEpy):
MATLAB:
ksdensity default: Rule of thumb123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np # Generate example datanp.random.seed(42)data = np.concatenate([ np.random.normal(-1, 0.5, 300), np.random.normal(1.5, 0.8, 200)]) # Method 1: scipy (Scott's rule by default)from scipy.stats import gaussian_kdekde_scipy = gaussian_kde(data) # Uses Scott's ruleprint(f"scipy (Scott): h = {kde_scipy.factor * np.std(data):.4f}") # Using Silverman in scipykde_silverman = gaussian_kde(data, bw_method='silverman')print(f"scipy (Silverman): h = {kde_silverman.factor * np.std(data):.4f}") # Method 2: statsmodels (more options)from statsmodels.nonparametric.kde import KDEUnivariate kde_statsmodels = KDEUnivariate(data)kde_statsmodels.fit(bw='silverman') # Can also use 'scott', 'normal_reference'print(f"statsmodels (Silverman): h = {kde_statsmodels.bw:.4f}") # Method 3: KDEpy (recommended for Sheather-Jones)# pip install KDEpytry: from KDEpy import FFTKDE # ISJ (Improved Sheather-Jones) - often best choice kde_isj = FFTKDE(bw='ISJ').fit(data) x_grid, density = kde_isj.evaluate() print(f"KDEpy (ISJ): h = {kde_isj.bw:.4f}") # Silverman for comparison kde_silv = FFTKDE(bw='silverman').fit(data) print(f"KDEpy (Silverman): h = {kde_silv.bw:.4f}")except ImportError: print("Install KDEpy for ISJ: pip install KDEpy") # Method 4: Manual Silverman calculationdef silverman_bandwidth(x): n = len(x) sigma = np.std(x, ddof=1) iqr = np.percentile(x, 75) - np.percentile(x, 25) # Robust version A = min(sigma, iqr / 1.34) return 0.9 * A * n**(-0.2) print(f"Manual (Silverman robust): h = {silverman_bandwidth(data):.4f}")We've covered the major approaches to bandwidth selection in kernel density estimation—from simple rules of thumb to sophisticated data-driven methods. Here are the key takeaways:
You now have a complete toolkit for bandwidth selection in univariate KDE. The next section explores what happens when we move to multiple dimensions—the notorious 'curse of dimensionality' that fundamentally limits the applicability of kernel density estimation in high-dimensional spaces.