Density Estimation & Mixture ModelsKDE Theory

Kernel Density Estimation Theory

LevelAdvanced

Duration90 mins

TopicKDE Theory

3 / 5

Bandwidth Selection Methods

Solving the Bandwidth Selection Problem

In the previous section, we established that the optimal bandwidth depends on $R(f'') = \int (f''(x))^2 dx$—a quantity that requires knowing the very density we're trying to estimate. This creates a fundamental circularity at the heart of kernel density estimation.

How do we break this circularity? Several ingenious approaches have been developed over decades of research. Each makes different assumptions and tradeoffs, and understanding them is essential for practitioners who want to move beyond default settings.

This page provides a comprehensive treatment of bandwidth selection methods, from simple rules of thumb to sophisticated data-driven techniques. We'll examine both the theory and practical implementation of each approach, giving you the tools to make informed choices for your specific applications.

What You Will Learn

By the end of this page, you will understand the reference rules of Silverman and Scott, plug-in methods that estimate $R(f'')$ from data, least-squares cross-validation (LSCV), biased cross-validation (BCV), the Sheather-Jones method (often considered the gold standard), and how to choose among these methods for different applications.

Reference Rules (Rules of Thumb)

The simplest approach to bandwidth selection is to assume the underlying density belongs to a known family (typically Gaussian) and derive the optimal bandwidth for that reference distribution. These "rules of thumb" require only basic sample statistics and are computationally trivial.

1.1 Silverman's Rule of Thumb

Silverman (1986) derived what has become the most widely used reference rule. Assuming the true density is Gaussian and using the Gaussian kernel:

$$h_{\text{Silverman}} = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{1/5} = 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$

where $\hat{\sigma}$ is the sample standard deviation.

Derivation:

For a $N(\mu, \sigma^2)$ density: $$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

$$f''(x) = \frac{1}{\sigma^3\sqrt{2\pi}} \left(\frac{(x-\mu)^2}{\sigma^2} - 1\right) \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Computing $R(f'') = \int (f''(x))^2 dx$ yields: $$R(f'') = \frac{3}{8\sigma^5\sqrt{\pi}}$$

Substituting into the AMISE-optimal bandwidth formula with the Gaussian kernel gives the result.

Robust Silverman Rule

Silverman also proposed a robust modification that uses the minimum of the standard deviation and a scaled interquartile range: $h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$. The factor 1.34 is the IQR of the standard normal. This provides protection against outliers and heavy tails.

1.2 Scott's Rule

Scott (1992) derived a similar rule with slightly different constant:

$$h_{\text{Scott}} = 3.49 \cdot \hat{\sigma} \cdot n^{-1/3}$$

Note: Scott's rule is actually designed for histograms (hence the $n^{-1/3}$ rate) and is sometimes adapted for KDE. For KDE specifically, the Silverman rule is more appropriate.

1.3 Comparison of Reference Rules:

Rule	Formula	$n^{-1/5}$ coefficient (using $\sigma$)
Silverman (standard)	$1.06 \sigma n^{-1/5}$	1.06
Silverman (robust)	$0.9 \min(\sigma, \text{IQR}/1.34) n^{-1/5}$	0.9
Normal reference (exact)	$1.059 \sigma n^{-1/5}$	1.059

When do reference rules work well?

Unimodal, roughly symmetric densities
When the density is not too different from Gaussian
As a quick initial estimate to be refined
When computational resources are very limited

Oversmoothing Problem

Reference rules tend to oversmooth for multimodal or skewed densities. The Gaussian reference has the smallest $R(f'')$ among densities with the same variance, so any departure from Gaussian typically means the true optimal bandwidth should be smaller. For bimodal distributions, oversmoothing can be severe—merging two distinct modes into one broad hump.

Plug-in Methods

Plug-in methods directly address the circularity problem by estimating the unknown functional $R(f'')$ from the data, then 'plugging in' this estimate to the optimal bandwidth formula.

The General Strategy:

Estimate $R(f'') = \int (f''(x))^2 dx$ from the data
Substitute this estimate into the AMISE-optimal formula
Use the resulting bandwidth for the final KDE

But wait—there's more circularity!

To estimate $R(f'')$, we need to estimate $f''$. The natural approach is to use a KDE of a derivative, which itself requires a bandwidth. This creates a cascade of bandwidths, each requiring estimation of higher-order functionals.

2.1 Estimating Density Derivatives

The $r$-th derivative of a KDE can be estimated as:

$$\hat{f}^{(r)}g(x) = \frac{1}{ng^{r+1}} \sum{i=1}^{n} K^{(r)}\left(\frac{x - X_i}{g}\right)$$

where $K^{(r)}$ is the $r$-th derivative of the kernel and $g$ is a 'pilot' bandwidth.

For estimating $R(f'') = \int (f''(x))^2 dx$, we need the second derivative:

$$\hat{R}(f'') = \int (\hat{f}''_g(x))^2 dx$$

2.2 The Bandwidth Cascade

The optimal pilot bandwidth $g$ for estimating $f''$ itself depends on $R(f^{(4)})$—the roughness of the fourth derivative. Estimating this requires another pilot bandwidth for $f^{(4)}$, which depends on $R(f^{(6)})$, and so on.

Breaking the cascade:

Two-stage plug-in: Use the normal reference to estimate the highest-order functional needed, then work backwards.
Sheather-Jones (solve-the-equation): Formulate the problem as a fixed-point equation and solve iteratively.

2.3 Two-Stage Plug-in (Park and Marron)

A practical two-stage procedure:

Stage 1: Use normal reference to estimate $R(f^{(4)})$: $$\hat{R}(f^{(4)}) = \frac{105}{32\sqrt{\pi}\hat{\sigma}^9}$$

Stage 2: Use this to compute optimal pilot bandwidth for estimating $f''$: $$g = \left(\frac{-2K^{(4)}(0)}{\mu_2(K) \hat{R}(f^{(4)}) n}\right)^{1/7}$$

Stage 3: Estimate $R(f'')$ using pilot bandwidth $g$: $$\hat{R}(f'') = \frac{1}{n^2 g^5} \sum_{i=1}^{n} \sum_{j=1}^{n} K^{(4)}\left(\frac{X_i - X_j}{g}\right)$$

Stage 4: Compute final bandwidth: $$h = \left(\frac{R(K)}{n \mu_2(K)^2 \hat{R}(f'')}\right)^{1/5}$$

Theoretical Properties

Two-stage plug-in estimators achieve an asymptotic relative efficiency (ARE) that converges to 1—they achieve the optimal bandwidth asymptotically. The rate of convergence of $\hat{h}$ to $h_{\text{opt}}$ is $n^{-5/14}$ for well-designed two-stage methods. This is fast enough that the excess MISE from suboptimal bandwidth selection is asymptotically negligible.

The Sheather-Jones Method

The Sheather-Jones (SJ) method, published in 1991, is widely considered the gold standard for automatic bandwidth selection. It uses a 'solve-the-equation' approach that elegantly handles the cascade problem.

The Key Insight:

Instead of using a fixed pilot bandwidth, SJ recognizes that the optimal pilot bandwidth depends on the optimal final bandwidth (and vice versa). This leads to a fixed-point equation that can be solved iteratively.

Mathematical Formulation:

Define the AMISE-optimal bandwidth as a function of $R(f'')$: $$h^* = \left(\frac{R(K)}{n \mu_2(K)^2 R(f'')}\right)^{1/5}$$

The estimated $R(f'')$ depends on a pilot bandwidth $g$, which itself is a function of $h$ (through the optimal relationship for derivative estimation): $$g = c_1 \cdot h^{5/7}$$

Combining these gives an implicit equation for $h$: $$h = \left(\frac{R(K)}{n \mu_2(K)^2 \hat{R}(f''; c_1 h^{5/7})}\right)^{1/5}$$

This can be written as a fixed-point equation: $h = T(h)$, where $T$ encapsulates the right-hand side.

sheather-jones.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy import optimize
from scipy.stats import norm
 
def sheather_jones_bandwidth(data):
    """
    Implements the Sheather-Jones 'solve-the-equation' bandwidth selector.
    
    This is considered the gold standard for automatic bandwidth selection
    in kernel density estimation.
    
    Parameters:
    -----------
    data : array-like
        Sample data for which to compute optimal bandwidth
    
    Returns:
    --------
    float
        Optimal bandwidth according to Sheather-Jones method
    """
    n = len(data)
    
    # Standardize data for numerical stability
    std = np.std(data, ddof=1)
    data_std = (data - np.mean(data)) / std
    
    # Constants for Gaussian kernel
    R_K = 1 / (2 * np.sqrt(np.pi))  # Roughness of Gaussian kernel
    mu_2_K = 1.0  # Second moment of Gaussian kernel
    
    def phi6(x):
        """6th derivative of standard normal density at 0."""
        return norm.pdf(x) * (x**6 - 15*x**4 + 45*x**2 - 15)
    
    def phi4(x):
        """4th derivative of standard normal density at 0."""
        return norm.pdf(x) * (x**4 - 6*x**2 + 3)
    
    def estimate_R_derivative(data, r, g):
        """
        Estimate R(f^(r)) = integral of (f^(r)(x))^2 dx
        using kernel derivative estimation with pilot bandwidth g.
        """
        n = len(data)
        total = 0.0
        
        for i in range(n):
            for j in range(n):
                diff = (data[i] - data[j]) / g
                if r == 4:
                    total += phi4(diff)
                elif r == 6:
                    total += phi6(diff)
        
        return total / (n**2 * g**(2*r + 1))
    
    def sj_equation(h):
        """
        The Sheather-Jones fixed-point equation.
        We need to find h such that this returns 0.
        """
        # Pilot bandwidth for estimating R(f^(4))
        # using optimal relationship from derivative estimation
        lambda_val = (8 * np.sqrt(np.pi) * estimate_R_derivative(
            data_std, 4, 
            (2 * norm.pdf(0, scale=1) / (mu_2_K * n))**(1/7)
        )) ** (1/5)
        
        # Pilot bandwidth for R(f'')
        g = lambda_val * h**(5/7)
        
        # Estimate R(f'')
        R_f_prime_prime = estimate_R_derivative(data_std, 4, g)
        
        # Target h from AMISE optimal formula
        h_target = (R_K / (n * mu_2_K**2 * abs(R_f_prime_prime)))**(1/5)
        
        return h - h_target
    
    # Initial guess using Silverman's rule
    h_init = 1.06 * n**(-1/5)
    
    # Solve the fixed-point equation
    try:
        h_opt = optimize.brentq(sj_equation, h_init/10, h_init*10)
    except ValueError:
        # Fallback to Silverman if SJ fails
        h_opt = h_init
    
    # Rescale back to original data scale
    return h_opt * std
 
# Example usage
np.random.seed(42)
 
# Unimodal data
normal_data = np.random.normal(0, 1, 500)
h_sj_normal = sheather_jones_bandwidth(normal_data)
print(f"Normal data - SJ bandwidth: {h_sj_normal:.4f}")
print(f"Normal data - Silverman bandwidth: {1.06 * np.std(normal_data) * len(normal_data)**(-0.2):.4f}")
 
# Bimodal data (where SJ excels over rule of thumb)
bimodal_data = np.concatenate([
    np.random.normal(-2, 0.5, 250),
    np.random.normal(2, 0.5, 250)
])
h_sj_bimodal = sheather_jones_bandwidth(bimodal_data)
print(f"\nBimodal data - SJ bandwidth: {h_sj_bimodal:.4f}")
print(f"Bimodal data - Silverman bandwidth: {1.06 * np.std(bimodal_data) * len(bimodal_data)**(-0.2):.4f}")

SJ in Practice

The Sheather-Jones method is available in most statistical software: In R, use bw.SJ(). In Python's scipy, use scipy.stats.gaussian_kde with bw_method='scott' or implement SJ manually. In MATLAB, use ksdensity with the 'Bandwidth' option set via custom code. SJ is generally preferred over simple rules of thumb whenever multimodality is suspected.

Cross-Validation Methods

Cross-validation (CV) approaches select the bandwidth that minimizes some estimated prediction error. Unlike plug-in methods that estimate theoretical quantities, CV methods directly use the data to evaluate bandwidth quality.

4.1 Least-Squares Cross-Validation (LSCV)

LSCV, also known as Unbiased Cross-Validation (UCV), directly targets the ISE (Integrated Squared Error):

$$\text{ISE}(h) = \int (\hat{f}_h(x) - f(x))^2 dx = \int \hat{f}_h^2(x) dx - 2\int \hat{f}_h(x) f(x) dx + \int f^2(x) dx$$

Since the last term doesn't depend on $h$, we minimize:

$$\text{CV}(h) = \int \hat{f}h^2(x) dx - 2 \cdot \frac{1}{n} \sum{i=1}^{n} \hat{f}_{h,-i}(X_i)$$

where $\hat{f}_{h,-i}$ is the leave-one-out KDE excluding observation $i$.

The LSCV criterion:

$$\text{LSCV}(h) = \frac{1}{n^2 h} \sum_{i=1}^{n} \sum_{j=1}^{n} \bar{K}\left(\frac{X_i - X_j}{h}\right) - \frac{2}{n(n-1)h} \sum_{i=1}^{n} \sum_{j \neq i} K\left(\frac{X_i - X_j}{h}\right)$$

where $\bar{K}(u) = K * K(u) = \int K(t)K(u-t)dt$ is the convolution of $K$ with itself.

4.2 Biased Cross-Validation (BCV)

BCV estimates the AMISE rather than the ISE, replacing the unknown $R(f'')$ with a kernel estimate:

$$\text{BCV}(h) = \frac{R(K)}{nh} + \frac{h^4 \mu_2(K)^2}{4} \hat{R}(f'')$$

where $\hat{R}(f'')$ is estimated using a kernel derivative estimator.

BCV is essentially a plug-in method reformulated as a CV criterion.

4.3 Comparison of CV Methods:

Method	Targets	Variability	Bias
LSCV	ISE	High	Unbiased
BCV	AMISE	Lower	Slightly biased

LSCV pros and cons:

✓ Unbiased estimation of ISE ✓ Works well for rough, multimodal densities ✓ No parametric assumptions ✗ High variability in selected $h$ ✗ Can severely undersmooth (select $h$ too small) ✗ CV criterion often has multiple local minima

The LSCV Variability Problem

LSCV is known to have high variance in bandwidth selection. The CV criterion is often quite flat near the optimum, and different samples from the same distribution can yield very different selected bandwidths. For this reason, plug-in methods (especially Sheather-Jones) are often preferred for unimodal or mildly multimodal densities, while LSCV may be preferred for heavily multimodal densities where plug-in methods' normal reference assumptions fail badly.

lscv-bandwidth.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy import optimize
from scipy.stats import norm
 
def lscv_bandwidth(data, h_range=None):
    """
    Least-Squares Cross-Validation bandwidth selector.
    
    Minimizes an unbiased estimate of the Integrated Squared Error.
    
    Parameters:
    -----------
    data : array-like
        Sample data
    h_range : tuple, optional
        (h_min, h_max) range to search over
    
    Returns:
    --------
    float
        LSCV-optimal bandwidth
    """
    n = len(data)
    std = np.std(data, ddof=1)
    
    # Default search range based on Silverman's rule
    h_silverman = 1.06 * std * n**(-0.2)
    if h_range is None:
        h_range = (h_silverman / 10, h_silverman * 5)
    
    def lscv_criterion(h):
        """
        Compute the LSCV criterion for bandwidth h.
        """
        # Term 1: integral of f_hat^2
        # This uses the convolution formula
        term1 = 0.0
        for i in range(n):
            for j in range(n):
                diff = (data[i] - data[j]) / h
                # K_bar(u) for Gaussian is N(0, 2) evaluated at u
                term1 += norm.pdf(diff, scale=np.sqrt(2))
        term1 /= (n**2 * h)
        
        # Term 2: leave-one-out prediction
        term2 = 0.0
        for i in range(n):
            for j in range(n):
                if i != j:
                    diff = (data[i] - data[j]) / h
                    term2 += norm.pdf(diff)
        term2 *= 2 / (n * (n - 1) * h)
        
        return term1 - term2
    
    # Minimize LSCV over h_range
    result = optimize.minimize_scalar(
        lscv_criterion, 
        bounds=h_range, 
        method='bounded'
    )
    
    return result.x
 
# Example: Compare LSCV with Silverman for different distributions
np.random.seed(42)
 
# Test on unimodal data
unimodal = np.random.normal(0, 1, 300)
h_lscv_uni = lscv_bandwidth(unimodal)
h_silv_uni = 1.06 * np.std(unimodal) * len(unimodal)**(-0.2)
print(f"Unimodal: LSCV = {h_lscv_uni:.4f}, Silverman = {h_silv_uni:.4f}")
 
# Test on bimodal data (LSCV should find smaller h)
bimodal = np.concatenate([
    np.random.normal(-2, 0.5, 150),
    np.random.normal(2, 0.5, 150)
])
h_lscv_bi = lscv_bandwidth(bimodal)
h_silv_bi = 1.06 * np.std(bimodal) * len(bimodal)**(-0.2)
print(f"Bimodal: LSCV = {h_lscv_bi:.4f}, Silverman = {h_silv_bi:.4f}")
 
# Test on strongly multimodal (claw-like)
claw = np.concatenate([
    np.random.normal(0, 1, 200),
    np.random.normal(-1, 0.1, 50),
    np.random.normal(-0.5, 0.1, 50),
    np.random.normal(0, 0.1, 50),
    np.random.normal(0.5, 0.1, 50),
    np.random.normal(1, 0.1, 50)
])
h_lscv_claw = lscv_bandwidth(claw)
h_silv_claw = 1.06 * np.std(claw) * len(claw)**(-0.2)
print(f"Claw-like: LSCV = {h_lscv_claw:.4f}, Silverman = {h_silv_claw:.4f}")

Likelihood Cross-Validation

An alternative to squared error-based CV is likelihood-based cross-validation, which maximizes a cross-validated log-likelihood.

Maximum Likelihood CV (MLCV):

The log-likelihood of the KDE is: $$\ell(h) = \sum_{i=1}^{n} \log \hat{f}_h(X_i)$$

But this is cheating—we're evaluating the density at the points used to construct it. Instead, use leave-one-out:

$$\text{MLCV}(h) = \sum_{i=1}^{n} \log \hat{f}_{h,-i}(X_i)$$

where $\hat{f}_{h,-i}$ is the KDE leaving out observation $i$.

Interpretation:

MLCV chooses the bandwidth that best predicts each observation from its neighbors. This is related to minimizing the Kullback-Leibler divergence from the true density to the estimate.

Comparison with LSCV:

Aspect	LSCV	MLCV
Targets	ISE / L² error	KL divergence
Behavior near zero	Penalizes zeros moderately	Penalizes zeros severely
For tail estimation	More stable	Highly variable
Common use	Preferred for visualization	Preferred for probabilistic models

Zero Density Problem

MLCV has a severe problem when $\hat{f}_{h,-i}(X_i) = 0$ for any observation (log of zero is undefined). This typically happens with small bandwidths and compact kernels, or in low-density regions. The Gaussian kernel avoids this issue since it's always positive, but MLCV can still be unstable when the estimate is very small at some observations.

Practical Comparison and Guidelines

With multiple bandwidth selection methods available, practitioners need guidance on when to use each approach. The choice depends on the data characteristics, computational constraints, and the specific application.

Summary of Methods:

Bandwidth Selection Methods Comparison
Method	Speed	Multimodal	Small n	Recommended Use
Silverman's rule	Fastest	Poor	OK	Quick initial estimate; unimodal data
Silverman (robust)	Fastest	Fair	Good	Outliers present; quick check
Two-stage plug-in	Fast	Good	Good	General purpose; automated pipelines
Sheather-Jones	Medium	Very good	Good	Default recommendation; most situations
LSCV	Slow	Excellent	High variance	Strongly multimodal; research applications
MLCV	Slow	Good	Unstable	Probabilistic applications

Use Plug-in Methods When

•Density is expected to be unimodal or mildly multimodal
•Computational speed is important
•Reproducibility is critical (lower variance)
•Sample size is moderate to large (n > 50)
•You want a reliable 'default' choice

Use Cross-Validation When

•Density is strongly multimodal
•Normal reference is clearly inappropriate
•You can afford higher computational cost
•You're willing to accept variance in h selection
•The application requires minimizing specific loss

Practical Workflow

Step 1: Start with Silverman's rule to get a quick baseline. Step 2: Apply Sheather-Jones for a more refined estimate. Step 3: If SJ and Silverman differ substantially, the density is likely non-normal—consider LSCV or visualize with multiple bandwidths. Step 4: Always plot the KDE with your selected bandwidth and ±50% to check robustness of conclusions.

Implementation in Statistical Software

Different statistical packages implement different default bandwidth selectors. Knowing what your software uses by default is essential for reproducibility.

density() default: bw.nrd0 (Silverman's rule)
Available selectors: bw.nrd, bw.ucv (LSCV), bw.bcv (BCV), bw.SJ (Sheather-Jones)
Recommended: Use bw.SJ explicitly for best results

Python (scipy):

scipy.stats.gaussian_kde default: Scott's rule
Available: 'scott', 'silverman', or numeric value
For SJ: Must implement manually or use KDEpy package

Python (statsmodels):

statsmodels.nonparametric.kde.KDEUnivariate
Supports various bandwidth methods including plug-in

Python (KDEpy):

Modern package with SJ, ISJ (improved SJ), and other methods
Recommended for production use requiring SJ bandwidth

MATLAB:

ksdensity default: Rule of thumb
Can specify bandwidth manually; limited built-in selectors

software-examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
# Generate example data
np.random.seed(42)
data = np.concatenate([
    np.random.normal(-1, 0.5, 300),
    np.random.normal(1.5, 0.8, 200)
])
 
# Method 1: scipy (Scott's rule by default)
from scipy.stats import gaussian_kde
kde_scipy = gaussian_kde(data)  # Uses Scott's rule
print(f"scipy (Scott): h = {kde_scipy.factor * np.std(data):.4f}")
 
# Using Silverman in scipy
kde_silverman = gaussian_kde(data, bw_method='silverman')
print(f"scipy (Silverman): h = {kde_silverman.factor * np.std(data):.4f}")
 
# Method 2: statsmodels (more options)
from statsmodels.nonparametric.kde import KDEUnivariate
 
kde_statsmodels = KDEUnivariate(data)
kde_statsmodels.fit(bw='silverman')  # Can also use 'scott', 'normal_reference'
print(f"statsmodels (Silverman): h = {kde_statsmodels.bw:.4f}")
 
# Method 3: KDEpy (recommended for Sheather-Jones)
# pip install KDEpy
try:
    from KDEpy import FFTKDE
    
    # ISJ (Improved Sheather-Jones) - often best choice
    kde_isj = FFTKDE(bw='ISJ').fit(data)
    x_grid, density = kde_isj.evaluate()
    print(f"KDEpy (ISJ): h = {kde_isj.bw:.4f}")
    
    # Silverman for comparison
    kde_silv = FFTKDE(bw='silverman').fit(data)
    print(f"KDEpy (Silverman): h = {kde_silv.bw:.4f}")
except ImportError:
    print("Install KDEpy for ISJ: pip install KDEpy")
 
# Method 4: Manual Silverman calculation
def silverman_bandwidth(x):
    n = len(x)
    sigma = np.std(x, ddof=1)
    iqr = np.percentile(x, 75) - np.percentile(x, 25)
    # Robust version
    A = min(sigma, iqr / 1.34)
    return 0.9 * A * n**(-0.2)
 
print(f"Manual (Silverman robust): h = {silverman_bandwidth(data):.4f}")

Summary: Bandwidth Selection Methods

We've covered the major approaches to bandwidth selection in kernel density estimation—from simple rules of thumb to sophisticated data-driven methods. Here are the key takeaways:

Key Takeaways

•Reference rules (Silverman, Scott) assume normal data and are quick but can severely oversmooth multimodal densities.
•Plug-in methods estimate the unknown $R(f'')$ from data, breaking the circularity through staged estimation.
•Sheather-Jones solves a fixed-point equation and is widely considered the gold standard for automatic bandwidth selection.
•LSCV is unbiased but high-variance; best for strongly multimodal densities.
•Different software packages have different defaults—always know what your package uses.
•No method is universally best—the right choice depends on data characteristics and application goals.
•Always visualize with multiple bandwidths to assess robustness of conclusions.

Coming Next: Curse of Dimensionality

You now have a complete toolkit for bandwidth selection in univariate KDE. The next section explores what happens when we move to multiple dimensions—the notorious 'curse of dimensionality' that fundamentally limits the applicability of kernel density estimation in high-dimensional spaces.

3 / 5

Loading learning content...

Density Estimation & Mixture ModelsKDE Theory

Kernel Density Estimation Theory

LevelAdvanced

Duration90 mins

TopicKDE Theory

3 / 5

Bandwidth Selection Methods

Solving the Bandwidth Selection Problem

What You Will Learn

Reference Rules (Rules of Thumb)

1.1 Silverman's Rule of Thumb

Silverman (1986) derived what has become the most widely used reference rule. Assuming the true density is Gaussian and using the Gaussian kernel:

$$h_{\text{Silverman}} = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{1/5} = 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$

where $\hat{\sigma}$ is the sample standard deviation.

Derivation:

For a $N(\mu, \sigma^2)$ density: $$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

$$f''(x) = \frac{1}{\sigma^3\sqrt{2\pi}} \left(\frac{(x-\mu)^2}{\sigma^2} - 1\right) \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Computing $R(f'') = \int (f''(x))^2 dx$ yields: $$R(f'') = \frac{3}{8\sigma^5\sqrt{\pi}}$$

Substituting into the AMISE-optimal bandwidth formula with the Gaussian kernel gives the result.

Robust Silverman Rule

1.2 Scott's Rule

Scott (1992) derived a similar rule with slightly different constant:

$$h_{\text{Scott}} = 3.49 \cdot \hat{\sigma} \cdot n^{-1/3}$$

Note: Scott's rule is actually designed for histograms (hence the $n^{-1/3}$ rate) and is sometimes adapted for KDE. For KDE specifically, the Silverman rule is more appropriate.

1.3 Comparison of Reference Rules:

Rule	Formula	$n^{-1/5}$ coefficient (using $\sigma$)
Silverman (standard)	$1.06 \sigma n^{-1/5}$	1.06
Silverman (robust)	$0.9 \min(\sigma, \text{IQR}/1.34) n^{-1/5}$	0.9
Normal reference (exact)	$1.059 \sigma n^{-1/5}$	1.059

When do reference rules work well?

Unimodal, roughly symmetric densities
When the density is not too different from Gaussian
As a quick initial estimate to be refined
When computational resources are very limited

Oversmoothing Problem

Plug-in Methods

Plug-in methods directly address the circularity problem by estimating the unknown functional $R(f'')$ from the data, then 'plugging in' this estimate to the optimal bandwidth formula.

The General Strategy:

Estimate $R(f'') = \int (f''(x))^2 dx$ from the data
Substitute this estimate into the AMISE-optimal formula
Use the resulting bandwidth for the final KDE

But wait—there's more circularity!

2.1 Estimating Density Derivatives

The $r$-th derivative of a KDE can be estimated as:

$$\hat{f}^{(r)}g(x) = \frac{1}{ng^{r+1}} \sum{i=1}^{n} K^{(r)}\left(\frac{x - X_i}{g}\right)$$

where $K^{(r)}$ is the $r$-th derivative of the kernel and $g$ is a 'pilot' bandwidth.

For estimating $R(f'') = \int (f''(x))^2 dx$, we need the second derivative:

$$\hat{R}(f'') = \int (\hat{f}''_g(x))^2 dx$$

2.2 The Bandwidth Cascade

Breaking the cascade:

Two-stage plug-in: Use the normal reference to estimate the highest-order functional needed, then work backwards.
Sheather-Jones (solve-the-equation): Formulate the problem as a fixed-point equation and solve iteratively.

2.3 Two-Stage Plug-in (Park and Marron)

A practical two-stage procedure:

Stage 1: Use normal reference to estimate $R(f^{(4)})$: $$\hat{R}(f^{(4)}) = \frac{105}{32\sqrt{\pi}\hat{\sigma}^9}$$

Stage 2: Use this to compute optimal pilot bandwidth for estimating $f''$: $$g = \left(\frac{-2K^{(4)}(0)}{\mu_2(K) \hat{R}(f^{(4)}) n}\right)^{1/7}$$

Stage 3: Estimate $R(f'')$ using pilot bandwidth $g$: $$\hat{R}(f'') = \frac{1}{n^2 g^5} \sum_{i=1}^{n} \sum_{j=1}^{n} K^{(4)}\left(\frac{X_i - X_j}{g}\right)$$

Stage 4: Compute final bandwidth: $$h = \left(\frac{R(K)}{n \mu_2(K)^2 \hat{R}(f'')}\right)^{1/5}$$

Theoretical Properties

The Sheather-Jones Method

The Key Insight:

Mathematical Formulation:

Define the AMISE-optimal bandwidth as a function of $R(f'')$: $$h^* = \left(\frac{R(K)}{n \mu_2(K)^2 R(f'')}\right)^{1/5}$$

The estimated $R(f'')$ depends on a pilot bandwidth $g$, which itself is a function of $h$ (through the optimal relationship for derivative estimation): $$g = c_1 \cdot h^{5/7}$$

Combining these gives an implicit equation for $h$: $$h = \left(\frac{R(K)}{n \mu_2(K)^2 \hat{R}(f''; c_1 h^{5/7})}\right)^{1/5}$$

This can be written as a fixed-point equation: $h = T(h)$, where $T$ encapsulates the right-hand side.

sheather-jones.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy import optimize
from scipy.stats import norm
 
def sheather_jones_bandwidth(data):
    """
    Implements the Sheather-Jones 'solve-the-equation' bandwidth selector.
    
    This is considered the gold standard for automatic bandwidth selection
    in kernel density estimation.
    
    Parameters:
    -----------
    data : array-like
        Sample data for which to compute optimal bandwidth
    
    Returns:
    --------
    float
        Optimal bandwidth according to Sheather-Jones method
    """
    n = len(data)
    
    # Standardize data for numerical stability
    std = np.std(data, ddof=1)
    data_std = (data - np.mean(data)) / std
    
    # Constants for Gaussian kernel
    R_K = 1 / (2 * np.sqrt(np.pi))  # Roughness of Gaussian kernel
    mu_2_K = 1.0  # Second moment of Gaussian kernel
    
    def phi6(x):
        """6th derivative of standard normal density at 0."""
        return norm.pdf(x) * (x**6 - 15*x**4 + 45*x**2 - 15)
    
    def phi4(x):
        """4th derivative of standard normal density at 0."""
        return norm.pdf(x) * (x**4 - 6*x**2 + 3)
    
    def estimate_R_derivative(data, r, g):
        """
        Estimate R(f^(r)) = integral of (f^(r)(x))^2 dx
        using kernel derivative estimation with pilot bandwidth g.
        """
        n = len(data)
        total = 0.0
        
        for i in range(n):
            for j in range(n):
                diff = (data[i] - data[j]) / g
                if r == 4:
                    total += phi4(diff)
                elif r == 6:
                    total += phi6(diff)
        
        return total / (n**2 * g**(2*r + 1))
    
    def sj_equation(h):
        """
        The Sheather-Jones fixed-point equation.
        We need to find h such that this returns 0.
        """
        # Pilot bandwidth for estimating R(f^(4))
        # using optimal relationship from derivative estimation
        lambda_val = (8 * np.sqrt(np.pi) * estimate_R_derivative(
            data_std, 4, 
            (2 * norm.pdf(0, scale=1) / (mu_2_K * n))**(1/7)
        )) ** (1/5)
        
        # Pilot bandwidth for R(f'')
        g = lambda_val * h**(5/7)
        
        # Estimate R(f'')
        R_f_prime_prime = estimate_R_derivative(data_std, 4, g)
        
        # Target h from AMISE optimal formula
        h_target = (R_K / (n * mu_2_K**2 * abs(R_f_prime_prime)))**(1/5)
        
        return h - h_target
    
    # Initial guess using Silverman's rule
    h_init = 1.06 * n**(-1/5)
    
    # Solve the fixed-point equation
    try:
        h_opt = optimize.brentq(sj_equation, h_init/10, h_init*10)
    except ValueError:
        # Fallback to Silverman if SJ fails
        h_opt = h_init
    
    # Rescale back to original data scale
    return h_opt * std
 
# Example usage
np.random.seed(42)
 
# Unimodal data
normal_data = np.random.normal(0, 1, 500)
h_sj_normal = sheather_jones_bandwidth(normal_data)
print(f"Normal data - SJ bandwidth: {h_sj_normal:.4f}")
print(f"Normal data - Silverman bandwidth: {1.06 * np.std(normal_data) * len(normal_data)**(-0.2):.4f}")
 
# Bimodal data (where SJ excels over rule of thumb)
bimodal_data = np.concatenate([
    np.random.normal(-2, 0.5, 250),
    np.random.normal(2, 0.5, 250)
])
h_sj_bimodal = sheather_jones_bandwidth(bimodal_data)
print(f"\nBimodal data - SJ bandwidth: {h_sj_bimodal:.4f}")
print(f"Bimodal data - Silverman bandwidth: {1.06 * np.std(bimodal_data) * len(bimodal_data)**(-0.2):.4f}")

SJ in Practice

Cross-Validation Methods

4.1 Least-Squares Cross-Validation (LSCV)

LSCV, also known as Unbiased Cross-Validation (UCV), directly targets the ISE (Integrated Squared Error):

$$\text{ISE}(h) = \int (\hat{f}_h(x) - f(x))^2 dx = \int \hat{f}_h^2(x) dx - 2\int \hat{f}_h(x) f(x) dx + \int f^2(x) dx$$

Since the last term doesn't depend on $h$, we minimize:

$$\text{CV}(h) = \int \hat{f}h^2(x) dx - 2 \cdot \frac{1}{n} \sum{i=1}^{n} \hat{f}_{h,-i}(X_i)$$

where $\hat{f}_{h,-i}$ is the leave-one-out KDE excluding observation $i$.

The LSCV criterion:

$$\text{LSCV}(h) = \frac{1}{n^2 h} \sum_{i=1}^{n} \sum_{j=1}^{n} \bar{K}\left(\frac{X_i - X_j}{h}\right) - \frac{2}{n(n-1)h} \sum_{i=1}^{n} \sum_{j \neq i} K\left(\frac{X_i - X_j}{h}\right)$$

where $\bar{K}(u) = K * K(u) = \int K(t)K(u-t)dt$ is the convolution of $K$ with itself.

4.2 Biased Cross-Validation (BCV)

BCV estimates the AMISE rather than the ISE, replacing the unknown $R(f'')$ with a kernel estimate:

$$\text{BCV}(h) = \frac{R(K)}{nh} + \frac{h^4 \mu_2(K)^2}{4} \hat{R}(f'')$$

where $\hat{R}(f'')$ is estimated using a kernel derivative estimator.

BCV is essentially a plug-in method reformulated as a CV criterion.

4.3 Comparison of CV Methods:

Method	Targets	Variability	Bias
LSCV	ISE	High	Unbiased
BCV	AMISE	Lower	Slightly biased

LSCV pros and cons:

The LSCV Variability Problem

lscv-bandwidth.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy import optimize
from scipy.stats import norm
 
def lscv_bandwidth(data, h_range=None):
    """
    Least-Squares Cross-Validation bandwidth selector.
    
    Minimizes an unbiased estimate of the Integrated Squared Error.
    
    Parameters:
    -----------
    data : array-like
        Sample data
    h_range : tuple, optional
        (h_min, h_max) range to search over
    
    Returns:
    --------
    float
        LSCV-optimal bandwidth
    """
    n = len(data)
    std = np.std(data, ddof=1)
    
    # Default search range based on Silverman's rule
    h_silverman = 1.06 * std * n**(-0.2)
    if h_range is None:
        h_range = (h_silverman / 10, h_silverman * 5)
    
    def lscv_criterion(h):
        """
        Compute the LSCV criterion for bandwidth h.
        """
        # Term 1: integral of f_hat^2
        # This uses the convolution formula
        term1 = 0.0
        for i in range(n):
            for j in range(n):
                diff = (data[i] - data[j]) / h
                # K_bar(u) for Gaussian is N(0, 2) evaluated at u
                term1 += norm.pdf(diff, scale=np.sqrt(2))
        term1 /= (n**2 * h)
        
        # Term 2: leave-one-out prediction
        term2 = 0.0
        for i in range(n):
            for j in range(n):
                if i != j:
                    diff = (data[i] - data[j]) / h
                    term2 += norm.pdf(diff)
        term2 *= 2 / (n * (n - 1) * h)
        
        return term1 - term2
    
    # Minimize LSCV over h_range
    result = optimize.minimize_scalar(
        lscv_criterion, 
        bounds=h_range, 
        method='bounded'
    )
    
    return result.x
 
# Example: Compare LSCV with Silverman for different distributions
np.random.seed(42)
 
# Test on unimodal data
unimodal = np.random.normal(0, 1, 300)
h_lscv_uni = lscv_bandwidth(unimodal)
h_silv_uni = 1.06 * np.std(unimodal) * len(unimodal)**(-0.2)
print(f"Unimodal: LSCV = {h_lscv_uni:.4f}, Silverman = {h_silv_uni:.4f}")
 
# Test on bimodal data (LSCV should find smaller h)
bimodal = np.concatenate([
    np.random.normal(-2, 0.5, 150),
    np.random.normal(2, 0.5, 150)
])
h_lscv_bi = lscv_bandwidth(bimodal)
h_silv_bi = 1.06 * np.std(bimodal) * len(bimodal)**(-0.2)
print(f"Bimodal: LSCV = {h_lscv_bi:.4f}, Silverman = {h_silv_bi:.4f}")
 
# Test on strongly multimodal (claw-like)
claw = np.concatenate([
    np.random.normal(0, 1, 200),
    np.random.normal(-1, 0.1, 50),
    np.random.normal(-0.5, 0.1, 50),
    np.random.normal(0, 0.1, 50),
    np.random.normal(0.5, 0.1, 50),
    np.random.normal(1, 0.1, 50)
])
h_lscv_claw = lscv_bandwidth(claw)
h_silv_claw = 1.06 * np.std(claw) * len(claw)**(-0.2)
print(f"Claw-like: LSCV = {h_lscv_claw:.4f}, Silverman = {h_silv_claw:.4f}")

Likelihood Cross-Validation

An alternative to squared error-based CV is likelihood-based cross-validation, which maximizes a cross-validated log-likelihood.

Maximum Likelihood CV (MLCV):

The log-likelihood of the KDE is: $$\ell(h) = \sum_{i=1}^{n} \log \hat{f}_h(X_i)$$

But this is cheating—we're evaluating the density at the points used to construct it. Instead, use leave-one-out:

$$\text{MLCV}(h) = \sum_{i=1}^{n} \log \hat{f}_{h,-i}(X_i)$$

where $\hat{f}_{h,-i}$ is the KDE leaving out observation $i$.

Interpretation:

MLCV chooses the bandwidth that best predicts each observation from its neighbors. This is related to minimizing the Kullback-Leibler divergence from the true density to the estimate.

Comparison with LSCV:

Aspect	LSCV	MLCV
Targets	ISE / L² error	KL divergence
Behavior near zero	Penalizes zeros moderately	Penalizes zeros severely
For tail estimation	More stable	Highly variable
Common use	Preferred for visualization	Preferred for probabilistic models

Zero Density Problem

Practical Comparison and Guidelines

Summary of Methods:

Bandwidth Selection Methods Comparison
Method	Speed	Multimodal	Small n	Recommended Use
Silverman's rule	Fastest	Poor	OK	Quick initial estimate; unimodal data
Silverman (robust)	Fastest	Fair	Good	Outliers present; quick check
Two-stage plug-in	Fast	Good	Good	General purpose; automated pipelines
Sheather-Jones	Medium	Very good	Good	Default recommendation; most situations
LSCV	Slow	Excellent	High variance	Strongly multimodal; research applications
MLCV	Slow	Good	Unstable	Probabilistic applications

Use Plug-in Methods When

•Density is expected to be unimodal or mildly multimodal
•Computational speed is important
•Reproducibility is critical (lower variance)
•Sample size is moderate to large (n > 50)
•You want a reliable 'default' choice

Use Cross-Validation When

•Density is strongly multimodal
•Normal reference is clearly inappropriate
•You can afford higher computational cost
•You're willing to accept variance in h selection
•The application requires minimizing specific loss

Practical Workflow

Implementation in Statistical Software

Different statistical packages implement different default bandwidth selectors. Knowing what your software uses by default is essential for reproducibility.

density() default: bw.nrd0 (Silverman's rule)
Available selectors: bw.nrd, bw.ucv (LSCV), bw.bcv (BCV), bw.SJ (Sheather-Jones)
Recommended: Use bw.SJ explicitly for best results

Python (scipy):

scipy.stats.gaussian_kde default: Scott's rule
Available: 'scott', 'silverman', or numeric value
For SJ: Must implement manually or use KDEpy package

Python (statsmodels):

statsmodels.nonparametric.kde.KDEUnivariate
Supports various bandwidth methods including plug-in

Python (KDEpy):

Modern package with SJ, ISJ (improved SJ), and other methods
Recommended for production use requiring SJ bandwidth

MATLAB:

ksdensity default: Rule of thumb
Can specify bandwidth manually; limited built-in selectors

software-examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
# Generate example data
np.random.seed(42)
data = np.concatenate([
    np.random.normal(-1, 0.5, 300),
    np.random.normal(1.5, 0.8, 200)
])
 
# Method 1: scipy (Scott's rule by default)
from scipy.stats import gaussian_kde
kde_scipy = gaussian_kde(data)  # Uses Scott's rule
print(f"scipy (Scott): h = {kde_scipy.factor * np.std(data):.4f}")
 
# Using Silverman in scipy
kde_silverman = gaussian_kde(data, bw_method='silverman')
print(f"scipy (Silverman): h = {kde_silverman.factor * np.std(data):.4f}")
 
# Method 2: statsmodels (more options)
from statsmodels.nonparametric.kde import KDEUnivariate
 
kde_statsmodels = KDEUnivariate(data)
kde_statsmodels.fit(bw='silverman')  # Can also use 'scott', 'normal_reference'
print(f"statsmodels (Silverman): h = {kde_statsmodels.bw:.4f}")
 
# Method 3: KDEpy (recommended for Sheather-Jones)
# pip install KDEpy
try:
    from KDEpy import FFTKDE
    
    # ISJ (Improved Sheather-Jones) - often best choice
    kde_isj = FFTKDE(bw='ISJ').fit(data)
    x_grid, density = kde_isj.evaluate()
    print(f"KDEpy (ISJ): h = {kde_isj.bw:.4f}")
    
    # Silverman for comparison
    kde_silv = FFTKDE(bw='silverman').fit(data)
    print(f"KDEpy (Silverman): h = {kde_silv.bw:.4f}")
except ImportError:
    print("Install KDEpy for ISJ: pip install KDEpy")
 
# Method 4: Manual Silverman calculation
def silverman_bandwidth(x):
    n = len(x)
    sigma = np.std(x, ddof=1)
    iqr = np.percentile(x, 75) - np.percentile(x, 25)
    # Robust version
    A = min(sigma, iqr / 1.34)
    return 0.9 * A * n**(-0.2)
 
print(f"Manual (Silverman robust): h = {silverman_bandwidth(data):.4f}")

Summary: Bandwidth Selection Methods

We've covered the major approaches to bandwidth selection in kernel density estimation—from simple rules of thumb to sophisticated data-driven methods. Here are the key takeaways:

Key Takeaways

•Reference rules (Silverman, Scott) assume normal data and are quick but can severely oversmooth multimodal densities.
•Plug-in methods estimate the unknown $R(f'')$ from data, breaking the circularity through staged estimation.
•Sheather-Jones solves a fixed-point equation and is widely considered the gold standard for automatic bandwidth selection.
•LSCV is unbiased but high-variance; best for strongly multimodal densities.
•Different software packages have different defaults—always know what your package uses.
•No method is universally best—the right choice depends on data characteristics and application goals.
•Always visualize with multiple bandwidths to assess robustness of conclusions.

Coming Next: Curse of Dimensionality

3 / 5