Kde Theory - Learning Module

Loading content...

0/278

Variable Bandwidth

Adapting Smoothing to Local Structure

Throughout this module, we've been using a single global bandwidth $h$ for all data points and all evaluation locations. This global bandwidth is a compromise—optimal on average but potentially far from optimal in specific regions of the density.

The fundamental problem:

Consider a density with both sharp peaks and broad, smooth regions:

In the peaks: A small bandwidth is needed to capture the structure without oversmoothing.
In the smooth regions: A larger bandwidth is acceptable and even preferable to reduce variance.
A global $h$: Too large → smears the peaks. Too small → noisy in the smooth regions.

Variable bandwidth methods address this by allowing the amount of smoothing to adapt to local conditions. The bandwidth becomes a function of location or data point, written as $h(x)$ or $h_i$, rather than a fixed constant.

What You Will Learn

By the end of this page, you will understand the motivation and theoretical justification for variable bandwidth methods, the distinction between balloon estimators and sample-smoothing approaches, Abramson's square-root law for optimal local bandwidth, nearest-neighbor bandwidth selection, local likelihood and local polynomial methods, practical implementation considerations, and the tradeoffs involved in using variable vs. fixed bandwidth.

Why Variable Bandwidth?

The case for variable bandwidth stems from three interconnected observations about fixed-bandwidth KDE.

1.1 Bias Varies with Location

Recall from our bias analysis that the local bias is:

$$\text{Bias}(\hat{f}_h(x)) \approx \frac{h^2}{2} f''(x) \mu_2(K)$$

The bias depends on the local curvature $f''(x)$. Where the density is highly curved (peaks, valleys), a smaller bandwidth is needed to keep bias low. Where the density is nearly flat, a larger bandwidth can be used without significant bias penalty, while reducing variance.

1.2 Variance Varies with Density Level

The local variance is:

$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$

Variance is proportional to the density itself, $f(x)$. In high-density regions, there are many nearby points contributing to the estimate, so we can afford smaller bandwidth. In low-density regions (tails), fewer points are available, and we need larger bandwidth to reduce variance.

The Dual Problem

High-density regions typically require smaller bandwidth (more data to smooth, want to preserve peaks), while low-density regions require larger bandwidth (less data available, need to borrow strength). This is precisely the opposite of what a fixed bandwidth provides—fixed h gives the same smoothing everywhere regardless of local data availability.

1.3 The Optimal Local Bandwidth

Minimizing pointwise MSE with respect to $h$ gives the locally optimal bandwidth:

$$h_{\text{opt}}(x) = \left(\frac{f(x) R(K)}{n \mu_2(K)^2 (f''(x))^2}\right)^{1/5}$$

This varies with location through both $f(x)$ and $f''(x)$. A truly optimal estimator would use this location-varying bandwidth.

The challenge: We don't know $f(x)$ or $f''(x)$ ahead of time—these are what we're trying to estimate!

Variable bandwidth methods approximate this ideal by:

Using a pilot estimate to determine local bandwidth
Relating bandwidth to local data density (e.g., distance to nearest neighbors)
Adaptive procedures that iterate between estimation and bandwidth refinement

Two Fundamental Approaches

Variable bandwidth methods fall into two distinct categories, depending on whether the bandwidth adapts to the evaluation point or to the data point.

2.1 Balloon Estimators (Point Adaptive)

In balloon estimators, the bandwidth is a function of the evaluation point $x$:

$$\hat{f}B(x) = \frac{1}{n h(x)} \sum{i=1}^{n} K\left(\frac{x - X_i}{h(x)}\right)$$

The 'balloon' analogy: imagine inflating a balloon at each evaluation point until it captures a fixed number of data points.

Properties:

✓ Automatically adapts to local data density
✓ Simple intuition: more smoothing where data is sparse
✗ The estimator is not guaranteed to integrate to 1
✗ Not a proper density estimate without renormalization

Common implementation: $k$-nearest neighbor bandwidth, where $h(x)$ is the distance to the $k$-th nearest neighbor from $x$.

2.2 Sample Smoothing Estimators (Data Adaptive)

In sample smoothing, each data point $X_i$ has its own bandwidth $h_i$:

$$\hat{f}S(x) = \frac{1}{n} \sum{i=1}^{n} \frac{1}{h_i} K\left(\frac{x - X_i}{h_i}\right)$$

Properties:

✓ Automatically integrates to 1 (is a proper density)
✓ Each data point contributes a normalized kernel
✓ Theoretically well-understood
✗ Bandwidth selection for each point can be complex

Common implementation: Abramson's square-root law (discussed below).

Comparison of Variable Bandwidth Approaches
Aspect	Balloon Estimator	Sample Smoothing
Bandwidth varies with	Evaluation point $x$	Data point $X_i$
Integrates to 1?	No (must renormalize)	Yes (automatically)
Interpretation	Adaptive neighborhoods	Adaptive contributions
Theoretical grounding	Good	Excellent (Abramson)
Computation	Must compute $h(x)$ at each evaluation	Precompute $h_i$ for all data
Common method	$k$-NN bandwidth	Square-root law

Which to Choose?

Sample smoothing estimators are generally preferred in the statistical literature because they automatically integrate to 1 and have cleaner theoretical properties. Balloon estimators are popular in machine learning applications where computational simplicity matters more than precise density estimation. For formal statistical inference, use sample smoothing; for exploratory analysis or anomaly detection, balloon estimators are often adequate.

Abramson's Square-Root Law

In 1982, Ian Abramson proved a remarkable result that provides the theoretical foundation for adaptive bandwidth selection. His square-root law shows that the optimal local bandwidth should be inversely proportional to the square root of the density.

Abramson's Theorem:

For densities with bounded, continuous derivatives, the bandwidth:

$$h_i = \frac{h_0}{\sqrt{f(X_i)/g}}$$

achieves a bias that is constant across the domain, where:

$h_0$ is a global bandwidth parameter (to be selected)
$f(X_i)$ is the true density at data point $X_i$
$g$ is the geometric mean of the density values: $g = \exp\left(\frac{1}{n}\sum_{i=1}^n \log f(X_i)\right)$

The Result:

With this bandwidth choice, the bias becomes: $$\text{Bias}(\hat{f}(x)) = \frac{h_0^2}{2} f''(x) \cdot f(x)^{-1} \cdot g \cdot \mu_2(K) + O(h_0^4)$$

which, remarkably, has the same order in $h_0$ everywhere, independent of the local density level.

Intuition Behind the Square Root

Why the square root? Recall that variance scales as $f(x)/h$ and bias² scales as $h^4 (f'')^2$. In high-density regions, we can use smaller $h$ (reducing bias) while maintaining acceptable variance because many points are nearby. The square-root relationship $h \propto f^{-1/2}$ is the mathematically optimal balance that equalizes the bias-variance tradeoff across density levels.

Practical Implementation:

Since $f(X_i)$ is unknown, we use a pilot estimate:

Algorithm (Abramson Adaptive Bandwidth):

Stage 1 (Pilot): Compute a fixed-bandwidth KDE with bandwidth $h_{\text{pilot}}$: $$\tilde{f}(x) = \frac{1}{n h_{\text{pilot}}} \sum_{i=1}^{n} K\left(\frac{x - X_i}{h_{\text{pilot}}}\right)$$
Compute local factors: For each data point, compute: $$\lambda_i = \left(\frac{\tilde{f}(X_i)}{g}\right)^{-1/2}$$ where $g = \left(\prod_{i=1}^n \tilde{f}(X_i)\right)^{1/n}$
Stage 2 (Adaptive): The adaptive KDE is: $$\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_0 \lambda_i} K\left(\frac{x - X_i}{h_0 \lambda_i}\right)$$

Choosing $h_0$ and $h_{\text{pilot}}$:

$h_{\text{pilot}}$: Often chosen using Silverman's rule. Slight oversmoothing is preferred to reduce pilot estimate variance.
$h_0$: Can use cross-validation on the adaptive estimator, or a plug-in rule. Often close to the AMISE-optimal fixed bandwidth.

abramson-kde.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.stats import norm
 
class AbramsonKDE:
    """
    Adaptive Kernel Density Estimator using Abramson's square-root law.
    
    The bandwidth at each data point is inversely proportional to the
    square root of a pilot density estimate, which optimally balances
    the bias-variance tradeoff across regions of varying density.
    """
    
    def __init__(self, pilot_bw='silverman', sensitivity=0.5):
        """
        Parameters:
        -----------
        pilot_bw : str or float
            Method for pilot bandwidth ('silverman', 'scott') or fixed value
        sensitivity : float
            Controls how strongly bandwidth adapts (0.5 = Abramson optimal)
            Higher values → more adaptation
        """
        self.pilot_bw = pilot_bw
        self.alpha = sensitivity  # Abramson uses 0.5
        
    def fit(self, data):
        """Fit the adaptive KDE to data."""
        self.data = np.atleast_1d(data).flatten()
        self.n = len(self.data)
        
        # Step 1: Compute pilot bandwidth
        std = np.std(self.data, ddof=1)
        iqr = np.percentile(self.data, 75) - np.percentile(self.data, 25)
        
        if self.pilot_bw == 'silverman':
            self.h_pilot = 0.9 * min(std, iqr/1.34) * self.n**(-0.2)
        elif self.pilot_bw == 'scott':
            self.h_pilot = 1.06 * std * self.n**(-0.2)
        else:
            self.h_pilot = float(self.pilot_bw)
        
        # Step 2: Compute pilot density at each data point
        pilot_densities = np.zeros(self.n)
        for i, xi in enumerate(self.data):
            u = (xi - self.data) / self.h_pilot
            pilot_densities[i] = np.mean(norm.pdf(u)) / self.h_pilot
        
        # Step 3: Compute local bandwidth factors (Abramson's law)
        # Avoid log(0) by flooring pilot densities
        pilot_densities = np.maximum(pilot_densities, 1e-10)
        
        # Geometric mean of pilot densities
        log_g = np.mean(np.log(pilot_densities))
        g = np.exp(log_g)
        
        # Local bandwidth factors: lambda_i = (f_pilot(X_i) / g)^(-alpha)
        self.lambdas = (pilot_densities / g) ** (-self.alpha)
        
        # Step 4: Global bandwidth h0 (use pilot bandwidth as baseline)
        # Could also use cross-validation here
        self.h0 = self.h_pilot
        
        # Effective bandwidths for each data point
        self.h_local = self.h0 * self.lambdas
        
        return self
    
    def evaluate(self, x):
        """Evaluate the adaptive KDE at points x."""
        x = np.atleast_1d(x).flatten()
        density = np.zeros(len(x))
        
        for i, xi in enumerate(x):
            # Each data point contributes with its own bandwidth
            u = (xi - self.data) / self.h_local
            contributions = norm.pdf(u) / self.h_local
            density[i] = np.mean(contributions)
        
        return density
    
    def get_bandwidth_summary(self):
        """Return summary statistics of local bandwidths."""
        return {
            'min': np.min(self.h_local),
            'max': np.max(self.h_local),
            'mean': np.mean(self.h_local),
            'median': np.median(self.h_local),
            'ratio_max_min': np.max(self.h_local) / np.min(self.h_local)
        }
 
# Example: Compare fixed vs adaptive KDE
np.random.seed(42)
 
# Mixture with varying local density
data = np.concatenate([
    np.random.normal(-3, 0.3, 200),   # Sharp peak
    np.random.normal(0, 1.5, 100),     # Broad region
    np.random.normal(4, 0.5, 150)      # Medium peak
])
 
# Fixed bandwidth KDE (Silverman)
h_fixed = 0.9 * min(np.std(data), 
                    (np.percentile(data, 75) - np.percentile(data, 25))/1.34
                   ) * len(data)**(-0.2)
 
# Adaptive KDE
adaptive_kde = AbramsonKDE(pilot_bw='silverman').fit(data)
bw_summary = adaptive_kde.get_bandwidth_summary()
 
print("Bandwidth Summary:")
print(f"  Fixed bandwidth: {h_fixed:.4f}")
print(f"  Adaptive - min:    {bw_summary['min']:.4f}")
print(f"  Adaptive - max:    {bw_summary['max']:.4f}")
print(f"  Adaptive - median: {bw_summary['median']:.4f}")
print(f"  Adaptive range ratio: {bw_summary['ratio_max_min']:.2f}x")
 
# Evaluate on grid
x_grid = np.linspace(-6, 8, 200)
density_adaptive = adaptive_kde.evaluate(x_grid)
 
# Fixed bandwidth density for comparison
from scipy.stats import gaussian_kde
fixed_kde = gaussian_kde(data, bw_method=h_fixed / np.std(data))
density_fixed = fixed_kde(x_grid)

Nearest-Neighbor Bandwidth Selection

An alternative to the density-based approach of Abramson is to define bandwidth in terms of nearest neighbors. This creates balloon estimators that automatically adapt to local data density.

4.1 The $k$-Nearest Neighbor (k-NN) Estimator

For an evaluation point $x$, let $d_k(x)$ be the distance to the $k$-th nearest neighbor in the sample. The k-NN density estimate is:

$$\hat{f}_{kNN}(x) = \frac{k}{n \cdot V_d \cdot d_k(x)^d}$$

where $V_d$ is the volume of the unit ball in $d$ dimensions.

Intuition: In high-density regions, the $k$-th neighbor is close (small $d_k(x)$), giving high estimated density. In low-density regions, the $k$-th neighbor is far, giving low estimated density.

4.2 k-NN as Variable Bandwidth KDE

We can also use the k-NN distance as the bandwidth in a KDE:

$$\hat{f}(x) = \frac{1}{n \cdot d_k(x)} \sum_{i=1}^{n} K\left(\frac{x - X_i}{d_k(x)}\right)$$

This combines the interpretability of KDE with the adaptive properties of k-NN.

4.3 Choosing $k$

The parameter $k$ plays a role analogous to bandwidth:

Small $k$: More adaptive, higher variance, captures local detail
Large $k$: Less adaptive, lower variance, smoother estimate

Rules of thumb for k:

$k \approx \sqrt{n}$ as a starting point
$k = n^{4/(d+4)}$ for optimal AMISE rate matching
In practice, use cross-validation or visual inspection

4.4 Advantages and Disadvantages:

Advantages of k-NN Bandwidth

•Simple intuition: Fixed number of neighbors, automatic adaptation
•No pilot estimate needed: Bandwidth is determined directly from data structure
•Robust to outliers: Far-away points don't affect local bandwidth
•Works in any dimension: Same algorithm for 1D to high-D
•Computationally efficient: KD-trees enable fast neighbor lookup

Disadvantages of k-NN Bandwidth

•Not a proper density: May not integrate to 1 without renormalization
•Suboptimal rate: Achieves slower convergence than Abramson
•Discontinuities: Bandwidth jumps when k-th neighbor changes
•Sensitive to ties: Equal distances can cause instability
•k selection: The problem of choosing bandwidth becomes choosing k

Practical Recommendation

For quick exploratory analysis, k-NN bandwidth is often sufficient and easy to implement. For formal statistical inference or when the density estimate will be used in downstream calculations (e.g., as a likelihood), prefer Abramson's method or other approaches that produce proper densities.

Local Polynomial and Local Likelihood Methods

Beyond simple kernel methods, more sophisticated local approaches use polynomial fitting or likelihood methods that automatically adapt to local structure.

5.1 Local Polynomial Density Estimation

Instead of placing a kernel at each data point, fit a local polynomial to the log-density. The local polynomial estimate at $x$ solves:

$$\max_{\beta} \sum_{i=1}^{n} \left[\beta_0 + \beta_1(X_i - x) + \ldots - \exp(\beta_0 + \ldots)\right] K_h(X_i - x)$$

The estimate is $\hat{f}(x) = \exp(\hat{\beta}_0)$.

Advantages:

Automatically adapts to local structure through polynomial fitting
Handles boundaries naturally with less bias than standard KDE
Higher-order polynomials can capture more local detail

5.2 Local Likelihood Density Estimation

Local likelihood approaches maximize a locally weighted likelihood:

$$\hat{f}(x) = \arg\max_{\theta} \sum_{i=1}^{n} w_i(x) \cdot \log p(X_i; \theta)$$

where $w_i(x) = K_h(x - X_i)$ are weights based on distance to $x$, and $p(\cdot; \theta)$ is a parametric family (often log-linear or Gaussian).

5.3 Logspline and Polyshrink Estimators

These are spline-based density estimators that use local adaptation implicitly:

Logspline: Fits a spline to the log-density, with knots adapted to data
Polyshrink: Combines global polynomial structure with local shrinkage

Benefits:

Smooth by construction (splines are smooth functions)
Automatic boundary handling
Built-in regularization prevents overfitting

5.4 When to Use Local Polynomial Methods?

Situation	Recommended Approach
Simple exploratory analysis	Standard KDE or k-NN
Formal inference, publication	Abramson adaptive or local likelihood
Boundary issues important	Local polynomial (reduced boundary bias)
Very smooth densities	Standard KDE may suffice
Complex multimodal structure	Abramson adaptive or local likelihood

Theoretical Properties of Variable Bandwidth Estimators

Variable bandwidth estimators have been extensively analyzed theoretically. Here we summarize the key results relevant to practitioners.

6.1 Asymptotic MISE

The Abramson sample-smoothing estimator achieves AMISE:

$$\text{AMISE} = O(n^{-4/5})$$

the same optimal rate as fixed-bandwidth KDE, but with an improved constant. Specifically, the integrated squared bias is reduced because bias is equalized across the domain.

6.2 Bias Reduction

The key theoretical advantage of Abramson's law is bias standardization:

Fixed bandwidth: Bias varies by $(f''/f)^{1/2}$ across the domain
Abramson adaptive: Bias is (approximately) constant everywhere

This can dramatically improve performance for densities with varying curvature.

6.3 Adaptation Costs

Not all is free—adaptive methods incur costs:

Pilot estimate variance: Errors in the pilot propagate to bandwidth selection, adding noise.
Two-stage inference: The final estimator's distribution is affected by pilot randomness.
Parameter explosion: Choosing pilot bandwidth, adaptation sensitivity ($\alpha$), and final bandwidth is more complex than just choosing one $h$.

When Adaptation Helps vs. Hurts

Adaptation helps most when: The target density has a wide range of local densities and curvatures (e.g., a mixture of sharp and broad components). Adaptation helps least when: The density is roughly uniform in structure (e.g., unimodal, symmetric). In this case, the added complexity of adaptation may not be worth the modest improvement, and standard KDE is often adequate.

6.4 Convergence Rate Comparison

Method	MISE Rate	Remarks
Fixed bandwidth (optimal $h$)	$O(n^{-4/5})$	Best possible for second-order kernels
Abramson adaptive	$O(n^{-4/5})$	Same rate, better constant for heterogeneous densities
k-NN balloon	$O(n^{-4/(d+4)})$	Standard rate, but suboptimal constant
Local polynomial (order $p$)	$O(n^{-2(p+1)/(2p+d+2)})$	Can achieve faster rates with higher order

Key insight: For second-order kernels, you can't beat the $n^{-4/5}$ rate (in 1D) regardless of variable bandwidth. The advantage is in the constant factor and robustness to density heterogeneity, not in the rate itself.

Implementation Considerations

Implementing variable bandwidth estimators requires attention to several practical details.

7.1 Computational Complexity

Method	Precomputation	Evaluation at $m$ points
Fixed KDE (naive)	—	$O(nm)$
Fixed KDE (FFT)	$O(n + G \log G)$	$O(m)$ with interpolation
Abramson adaptive	$O(n^2)$ (pilot at data points)	$O(nm)$ (no FFT possible)
k-NN adaptive	$O(n \log n)$ (build tree)	$O(km \log n)$

Key issue: Variable bandwidth breaks the convolution structure that enables FFT acceleration. For large datasets, this can be a significant bottleneck.

Mitigation strategies:

Use approximate methods (random sampling for pilot)
Bin data to reduce effective $n$
Parallelize the double sum
Use GPU acceleration

variable-bandwidth-practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from scipy.spatial import KDTree
 
class KNNAdaptiveKDE:
    """
    k-Nearest Neighbor Adaptive Kernel Density Estimator.
    
    Uses the distance to the k-th nearest neighbor as the
    local bandwidth at each evaluation point.
    """
    
    def __init__(self, k=None, kernel='gaussian'):
        """
        Parameters:
        -----------
        k : int or None
            Number of neighbors. If None, uses sqrt(n).
        kernel : str
            Kernel type: 'gaussian' or 'epanechnikov'
        """
        self.k = k
        self.kernel = kernel
    
    def fit(self, data):
        """Build the KD-tree for efficient neighbor queries."""
        self.data = np.atleast_2d(data)
        if self.data.shape[0] == 1:
            self.data = self.data.T
        self.n, self.d = self.data.shape
        
        # Default k
        if self.k is None:
            self.k = max(1, int(np.sqrt(self.n)))
        
        # Build KD-tree
        self.tree = KDTree(self.data)
        
        return self
    
    def _kernel(self, u):
        """Evaluate kernel at scaled distances u."""
        if self.kernel == 'gaussian':
            return np.exp(-0.5 * u**2) / np.sqrt(2 * np.pi)
        elif self.kernel == 'epanechnikov':
            return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0)
    
    def evaluate(self, x):
        """Evaluate adaptive KDE at points x."""
        x = np.atleast_2d(x)
        if x.shape[0] == 1 and self.d > 1:
            x = x.T
        elif x.shape[1] != self.d:
            x = x.reshape(-1, self.d)
        
        m = x.shape[0]
        density = np.zeros(m)
        
        for i in range(m):
            # Find k-th nearest neighbor distance
            distances, _ = self.tree.query(x[i], k=self.k + 1)
            h_local = distances[-1]  # k-th neighbor (excluding self)
            
            if h_local < 1e-10:
                h_local = 1e-10  # Prevent division by zero
            
            # Evaluate kernel at all data points with this bandwidth
            if self.d == 1:
                u = (x[i] - self.data.flatten()) / h_local
            else:
                u = np.linalg.norm(x[i] - self.data, axis=1) / h_local
            
            density[i] = np.mean(self._kernel(u)) / h_local**self.d
        
        return density
 
def compare_kde_methods(data, x_eval):
    """Compare fixed, Abramson, and k-NN adaptive KDE."""
    from scipy.stats import gaussian_kde
    
    n = len(data)
    results = {}
    
    # 1. Fixed bandwidth (Silverman)
    fixed_kde = gaussian_kde(data, bw_method='silverman')
    results['fixed'] = fixed_kde(x_eval)
    results['fixed_bw'] = fixed_kde.factor * np.std(data)
    
    # 2. Abramson adaptive
    abramson = AbramsonKDE().fit(data)
    results['abramson'] = abramson.evaluate(x_eval)
    results['abramson_bw_range'] = (
        abramson.h_local.min(), 
        abramson.h_local.max()
    )
    
    # 3. k-NN adaptive
    knn_kde = KNNAdaptiveKDE(k=int(np.sqrt(n))).fit(data.reshape(-1, 1))
    results['knn'] = knn_kde.evaluate(x_eval.reshape(-1, 1))
    
    return results
 
# Example comparison
np.random.seed(42)
 
# Create challenging density: sharp peak + broad region + tail
data = np.concatenate([
    np.random.normal(-2, 0.2, 300),    # Very sharp
    np.random.normal(1, 1.5, 150),      # Very broad
    np.random.exponential(2, 100) + 3   # Long tail
])
 
x_eval = np.linspace(-4, 15, 300)
results = compare_kde_methods(data, x_eval)
 
print("Comparison of KDE Methods:")
print(f"Fixed bandwidth: {results['fixed_bw']:.4f}")
print(f"Abramson bandwidth range: {results['abramson_bw_range'][0]:.4f} - {results['abramson_bw_range'][1]:.4f}")
print(f"Abramson adapts by factor of {results['abramson_bw_range'][1]/results['abramson_bw_range'][0]:.1f}x")

7.2 Numerical Stability

Floor densities: When computing $f^{-1/2}$, very small pilot densities cause bandwidth explosion. Apply a floor: $\tilde{f}(x) \leftarrow \max(\tilde{f}(x), \epsilon)$.
Cap bandwidth ratios: Prevent extreme bandwidth variation by capping: $h_i \leftarrow \text{clip}(h_i, h_0/r, h_0 \cdot r)$ for some ratio $r$ (e.g., $r = 5$).
Sensitivity to outliers: Outliers get large bandwidths (low pilot density). This may or may not be desired. Consider robust pilot methods.

7.3 Software Availability

R: The ks package provides excellent adaptive KDE via kde() with adaptive=TRUE.
Python: KDEpy has some adaptive features; otherwise implement manually.
MATLAB: The ksdensity function has limited adaptive options; manual implementation often needed.

Summary: Variable Bandwidth Methods

Variable bandwidth methods address a fundamental limitation of fixed-bandwidth KDE: the inability to optimally smooth regions with different local characteristics. Here are the key takeaways:

Key Takeaways

•The optimal bandwidth varies with location — In high-density regions, use smaller bandwidth; in low-density regions, use larger bandwidth.
•Two approaches exist: Balloon estimators (bandwidth varies with evaluation point) and sample smoothing (bandwidth varies with data point).
•Abramson's square-root law is foundational — Bandwidth proportional to $f^{-1/2}$ equalizes bias across the domain.
•k-NN methods are simple and effective — Using distance to k-th neighbor as bandwidth automatically adapts to local density.
•Adaptation incurs costs — More complex implementation, loss of FFT acceleration, and additional tuning parameters.
•Benefits are greatest for heterogeneous densities — If the target density is relatively homogeneous, fixed bandwidth may suffice.
•Sample smoothing produces proper densities — Prefer this approach when the estimate must integrate to 1.

Module Complete!

You have completed Module 2: KDE Theory. You now have a deep understanding of kernel functions, the bias-variance tradeoff, bandwidth selection methods, the curse of dimensionality, and variable bandwidth techniques. This theoretical foundation prepares you for the next modules on Gaussian Mixture Models and the Expectation-Maximization algorithm, where many of these concepts will reappear in parametric settings.

Variable Bandwidth

Adapting Smoothing to Local Structure

The fundamental problem:

Consider a density with both sharp peaks and broad, smooth regions:

In the peaks: A small bandwidth is needed to capture the structure without oversmoothing.
In the smooth regions: A larger bandwidth is acceptable and even preferable to reduce variance.
A global $h$: Too large → smears the peaks. Too small → noisy in the smooth regions.

What You Will Learn

Why Variable Bandwidth?

The case for variable bandwidth stems from three interconnected observations about fixed-bandwidth KDE.

1.1 Bias Varies with Location

Recall from our bias analysis that the local bias is:

$$\text{Bias}(\hat{f}_h(x)) \approx \frac{h^2}{2} f''(x) \mu_2(K)$$

1.2 Variance Varies with Density Level

The local variance is:

$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$

The Dual Problem

1.3 The Optimal Local Bandwidth

Minimizing pointwise MSE with respect to $h$ gives the locally optimal bandwidth:

$$h_{\text{opt}}(x) = \left(\frac{f(x) R(K)}{n \mu_2(K)^2 (f''(x))^2}\right)^{1/5}$$

This varies with location through both $f(x)$ and $f''(x)$. A truly optimal estimator would use this location-varying bandwidth.

The challenge: We don't know $f(x)$ or $f''(x)$ ahead of time—these are what we're trying to estimate!

Variable bandwidth methods approximate this ideal by:

Using a pilot estimate to determine local bandwidth
Relating bandwidth to local data density (e.g., distance to nearest neighbors)
Adaptive procedures that iterate between estimation and bandwidth refinement

Two Fundamental Approaches

Variable bandwidth methods fall into two distinct categories, depending on whether the bandwidth adapts to the evaluation point or to the data point.

2.1 Balloon Estimators (Point Adaptive)

In balloon estimators, the bandwidth is a function of the evaluation point $x$:

$$\hat{f}B(x) = \frac{1}{n h(x)} \sum{i=1}^{n} K\left(\frac{x - X_i}{h(x)}\right)$$

The 'balloon' analogy: imagine inflating a balloon at each evaluation point until it captures a fixed number of data points.

Properties:

✓ Automatically adapts to local data density
✓ Simple intuition: more smoothing where data is sparse
✗ The estimator is not guaranteed to integrate to 1
✗ Not a proper density estimate without renormalization

Common implementation: $k$-nearest neighbor bandwidth, where $h(x)$ is the distance to the $k$-th nearest neighbor from $x$.

2.2 Sample Smoothing Estimators (Data Adaptive)

In sample smoothing, each data point $X_i$ has its own bandwidth $h_i$:

$$\hat{f}S(x) = \frac{1}{n} \sum{i=1}^{n} \frac{1}{h_i} K\left(\frac{x - X_i}{h_i}\right)$$

Properties:

✓ Automatically integrates to 1 (is a proper density)
✓ Each data point contributes a normalized kernel
✓ Theoretically well-understood
✗ Bandwidth selection for each point can be complex

Common implementation: Abramson's square-root law (discussed below).

Comparison of Variable Bandwidth Approaches
Aspect	Balloon Estimator	Sample Smoothing
Bandwidth varies with	Evaluation point $x$	Data point $X_i$
Integrates to 1?	No (must renormalize)	Yes (automatically)
Interpretation	Adaptive neighborhoods	Adaptive contributions
Theoretical grounding	Good	Excellent (Abramson)
Computation	Must compute $h(x)$ at each evaluation	Precompute $h_i$ for all data
Common method	$k$-NN bandwidth	Square-root law

Which to Choose?

Abramson's Square-Root Law

Abramson's Theorem:

For densities with bounded, continuous derivatives, the bandwidth:

$$h_i = \frac{h_0}{\sqrt{f(X_i)/g}}$$

achieves a bias that is constant across the domain, where:

$h_0$ is a global bandwidth parameter (to be selected)
$f(X_i)$ is the true density at data point $X_i$
$g$ is the geometric mean of the density values: $g = \exp\left(\frac{1}{n}\sum_{i=1}^n \log f(X_i)\right)$

The Result:

With this bandwidth choice, the bias becomes: $$\text{Bias}(\hat{f}(x)) = \frac{h_0^2}{2} f''(x) \cdot f(x)^{-1} \cdot g \cdot \mu_2(K) + O(h_0^4)$$

which, remarkably, has the same order in $h_0$ everywhere, independent of the local density level.

Intuition Behind the Square Root

Practical Implementation:

Since $f(X_i)$ is unknown, we use a pilot estimate:

Algorithm (Abramson Adaptive Bandwidth):

Stage 1 (Pilot): Compute a fixed-bandwidth KDE with bandwidth $h_{\text{pilot}}$: $$\tilde{f}(x) = \frac{1}{n h_{\text{pilot}}} \sum_{i=1}^{n} K\left(\frac{x - X_i}{h_{\text{pilot}}}\right)$$
Compute local factors: For each data point, compute: $$\lambda_i = \left(\frac{\tilde{f}(X_i)}{g}\right)^{-1/2}$$ where $g = \left(\prod_{i=1}^n \tilde{f}(X_i)\right)^{1/n}$
Stage 2 (Adaptive): The adaptive KDE is: $$\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_0 \lambda_i} K\left(\frac{x - X_i}{h_0 \lambda_i}\right)$$

Choosing $h_0$ and $h_{\text{pilot}}$:

$h_{\text{pilot}}$: Often chosen using Silverman's rule. Slight oversmoothing is preferred to reduce pilot estimate variance.
$h_0$: Can use cross-validation on the adaptive estimator, or a plug-in rule. Often close to the AMISE-optimal fixed bandwidth.

abramson-kde.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.stats import norm
 
class AbramsonKDE:
    """
    Adaptive Kernel Density Estimator using Abramson's square-root law.
    
    The bandwidth at each data point is inversely proportional to the
    square root of a pilot density estimate, which optimally balances
    the bias-variance tradeoff across regions of varying density.
    """
    
    def __init__(self, pilot_bw='silverman', sensitivity=0.5):
        """
        Parameters:
        -----------
        pilot_bw : str or float
            Method for pilot bandwidth ('silverman', 'scott') or fixed value
        sensitivity : float
            Controls how strongly bandwidth adapts (0.5 = Abramson optimal)
            Higher values → more adaptation
        """
        self.pilot_bw = pilot_bw
        self.alpha = sensitivity  # Abramson uses 0.5
        
    def fit(self, data):
        """Fit the adaptive KDE to data."""
        self.data = np.atleast_1d(data).flatten()
        self.n = len(self.data)
        
        # Step 1: Compute pilot bandwidth
        std = np.std(self.data, ddof=1)
        iqr = np.percentile(self.data, 75) - np.percentile(self.data, 25)
        
        if self.pilot_bw == 'silverman':
            self.h_pilot = 0.9 * min(std, iqr/1.34) * self.n**(-0.2)
        elif self.pilot_bw == 'scott':
            self.h_pilot = 1.06 * std * self.n**(-0.2)
        else:
            self.h_pilot = float(self.pilot_bw)
        
        # Step 2: Compute pilot density at each data point
        pilot_densities = np.zeros(self.n)
        for i, xi in enumerate(self.data):
            u = (xi - self.data) / self.h_pilot
            pilot_densities[i] = np.mean(norm.pdf(u)) / self.h_pilot
        
        # Step 3: Compute local bandwidth factors (Abramson's law)
        # Avoid log(0) by flooring pilot densities
        pilot_densities = np.maximum(pilot_densities, 1e-10)
        
        # Geometric mean of pilot densities
        log_g = np.mean(np.log(pilot_densities))
        g = np.exp(log_g)
        
        # Local bandwidth factors: lambda_i = (f_pilot(X_i) / g)^(-alpha)
        self.lambdas = (pilot_densities / g) ** (-self.alpha)
        
        # Step 4: Global bandwidth h0 (use pilot bandwidth as baseline)
        # Could also use cross-validation here
        self.h0 = self.h_pilot
        
        # Effective bandwidths for each data point
        self.h_local = self.h0 * self.lambdas
        
        return self
    
    def evaluate(self, x):
        """Evaluate the adaptive KDE at points x."""
        x = np.atleast_1d(x).flatten()
        density = np.zeros(len(x))
        
        for i, xi in enumerate(x):
            # Each data point contributes with its own bandwidth
            u = (xi - self.data) / self.h_local
            contributions = norm.pdf(u) / self.h_local
            density[i] = np.mean(contributions)
        
        return density
    
    def get_bandwidth_summary(self):
        """Return summary statistics of local bandwidths."""
        return {
            'min': np.min(self.h_local),
            'max': np.max(self.h_local),
            'mean': np.mean(self.h_local),
            'median': np.median(self.h_local),
            'ratio_max_min': np.max(self.h_local) / np.min(self.h_local)
        }
 
# Example: Compare fixed vs adaptive KDE
np.random.seed(42)
 
# Mixture with varying local density
data = np.concatenate([
    np.random.normal(-3, 0.3, 200),   # Sharp peak
    np.random.normal(0, 1.5, 100),     # Broad region
    np.random.normal(4, 0.5, 150)      # Medium peak
])
 
# Fixed bandwidth KDE (Silverman)
h_fixed = 0.9 * min(np.std(data), 
                    (np.percentile(data, 75) - np.percentile(data, 25))/1.34
                   ) * len(data)**(-0.2)
 
# Adaptive KDE
adaptive_kde = AbramsonKDE(pilot_bw='silverman').fit(data)
bw_summary = adaptive_kde.get_bandwidth_summary()
 
print("Bandwidth Summary:")
print(f"  Fixed bandwidth: {h_fixed:.4f}")
print(f"  Adaptive - min:    {bw_summary['min']:.4f}")
print(f"  Adaptive - max:    {bw_summary['max']:.4f}")
print(f"  Adaptive - median: {bw_summary['median']:.4f}")
print(f"  Adaptive range ratio: {bw_summary['ratio_max_min']:.2f}x")
 
# Evaluate on grid
x_grid = np.linspace(-6, 8, 200)
density_adaptive = adaptive_kde.evaluate(x_grid)
 
# Fixed bandwidth density for comparison
from scipy.stats import gaussian_kde
fixed_kde = gaussian_kde(data, bw_method=h_fixed / np.std(data))
density_fixed = fixed_kde(x_grid)

Nearest-Neighbor Bandwidth Selection

An alternative to the density-based approach of Abramson is to define bandwidth in terms of nearest neighbors. This creates balloon estimators that automatically adapt to local data density.

4.1 The $k$-Nearest Neighbor (k-NN) Estimator

For an evaluation point $x$, let $d_k(x)$ be the distance to the $k$-th nearest neighbor in the sample. The k-NN density estimate is:

$$\hat{f}_{kNN}(x) = \frac{k}{n \cdot V_d \cdot d_k(x)^d}$$

where $V_d$ is the volume of the unit ball in $d$ dimensions.

Intuition: In high-density regions, the $k$-th neighbor is close (small $d_k(x)$), giving high estimated density. In low-density regions, the $k$-th neighbor is far, giving low estimated density.

4.2 k-NN as Variable Bandwidth KDE

We can also use the k-NN distance as the bandwidth in a KDE:

$$\hat{f}(x) = \frac{1}{n \cdot d_k(x)} \sum_{i=1}^{n} K\left(\frac{x - X_i}{d_k(x)}\right)$$

This combines the interpretability of KDE with the adaptive properties of k-NN.

4.3 Choosing $k$

The parameter $k$ plays a role analogous to bandwidth:

Small $k$: More adaptive, higher variance, captures local detail
Large $k$: Less adaptive, lower variance, smoother estimate

Rules of thumb for k:

$k \approx \sqrt{n}$ as a starting point
$k = n^{4/(d+4)}$ for optimal AMISE rate matching
In practice, use cross-validation or visual inspection

4.4 Advantages and Disadvantages:

Advantages of k-NN Bandwidth

•Simple intuition: Fixed number of neighbors, automatic adaptation
•No pilot estimate needed: Bandwidth is determined directly from data structure
•Robust to outliers: Far-away points don't affect local bandwidth
•Works in any dimension: Same algorithm for 1D to high-D
•Computationally efficient: KD-trees enable fast neighbor lookup

Disadvantages of k-NN Bandwidth

•Not a proper density: May not integrate to 1 without renormalization
•Suboptimal rate: Achieves slower convergence than Abramson
•Discontinuities: Bandwidth jumps when k-th neighbor changes
•Sensitive to ties: Equal distances can cause instability
•k selection: The problem of choosing bandwidth becomes choosing k

Practical Recommendation

Local Polynomial and Local Likelihood Methods

Beyond simple kernel methods, more sophisticated local approaches use polynomial fitting or likelihood methods that automatically adapt to local structure.

5.1 Local Polynomial Density Estimation

Instead of placing a kernel at each data point, fit a local polynomial to the log-density. The local polynomial estimate at $x$ solves:

$$\max_{\beta} \sum_{i=1}^{n} \left[\beta_0 + \beta_1(X_i - x) + \ldots - \exp(\beta_0 + \ldots)\right] K_h(X_i - x)$$

The estimate is $\hat{f}(x) = \exp(\hat{\beta}_0)$.

Advantages:

Automatically adapts to local structure through polynomial fitting
Handles boundaries naturally with less bias than standard KDE
Higher-order polynomials can capture more local detail

5.2 Local Likelihood Density Estimation

Local likelihood approaches maximize a locally weighted likelihood:

$$\hat{f}(x) = \arg\max_{\theta} \sum_{i=1}^{n} w_i(x) \cdot \log p(X_i; \theta)$$

where $w_i(x) = K_h(x - X_i)$ are weights based on distance to $x$, and $p(\cdot; \theta)$ is a parametric family (often log-linear or Gaussian).

5.3 Logspline and Polyshrink Estimators

These are spline-based density estimators that use local adaptation implicitly:

Logspline: Fits a spline to the log-density, with knots adapted to data
Polyshrink: Combines global polynomial structure with local shrinkage

Benefits:

Smooth by construction (splines are smooth functions)
Automatic boundary handling
Built-in regularization prevents overfitting

5.4 When to Use Local Polynomial Methods?

Situation	Recommended Approach
Simple exploratory analysis	Standard KDE or k-NN
Formal inference, publication	Abramson adaptive or local likelihood
Boundary issues important	Local polynomial (reduced boundary bias)
Very smooth densities	Standard KDE may suffice
Complex multimodal structure	Abramson adaptive or local likelihood

Theoretical Properties of Variable Bandwidth Estimators

Variable bandwidth estimators have been extensively analyzed theoretically. Here we summarize the key results relevant to practitioners.

6.1 Asymptotic MISE

The Abramson sample-smoothing estimator achieves AMISE:

$$\text{AMISE} = O(n^{-4/5})$$

the same optimal rate as fixed-bandwidth KDE, but with an improved constant. Specifically, the integrated squared bias is reduced because bias is equalized across the domain.

6.2 Bias Reduction

The key theoretical advantage of Abramson's law is bias standardization:

Fixed bandwidth: Bias varies by $(f''/f)^{1/2}$ across the domain
Abramson adaptive: Bias is (approximately) constant everywhere

This can dramatically improve performance for densities with varying curvature.

6.3 Adaptation Costs

Not all is free—adaptive methods incur costs:

Pilot estimate variance: Errors in the pilot propagate to bandwidth selection, adding noise.
Two-stage inference: The final estimator's distribution is affected by pilot randomness.
Parameter explosion: Choosing pilot bandwidth, adaptation sensitivity ($\alpha$), and final bandwidth is more complex than just choosing one $h$.

When Adaptation Helps vs. Hurts

6.4 Convergence Rate Comparison

Method	MISE Rate	Remarks
Fixed bandwidth (optimal $h$)	$O(n^{-4/5})$	Best possible for second-order kernels
Abramson adaptive	$O(n^{-4/5})$	Same rate, better constant for heterogeneous densities
k-NN balloon	$O(n^{-4/(d+4)})$	Standard rate, but suboptimal constant
Local polynomial (order $p$)	$O(n^{-2(p+1)/(2p+d+2)})$	Can achieve faster rates with higher order

Implementation Considerations

Implementing variable bandwidth estimators requires attention to several practical details.

7.1 Computational Complexity

Method	Precomputation	Evaluation at $m$ points
Fixed KDE (naive)	—	$O(nm)$
Fixed KDE (FFT)	$O(n + G \log G)$	$O(m)$ with interpolation
Abramson adaptive	$O(n^2)$ (pilot at data points)	$O(nm)$ (no FFT possible)
k-NN adaptive	$O(n \log n)$ (build tree)	$O(km \log n)$

Key issue: Variable bandwidth breaks the convolution structure that enables FFT acceleration. For large datasets, this can be a significant bottleneck.

Mitigation strategies:

Use approximate methods (random sampling for pilot)
Bin data to reduce effective $n$
Parallelize the double sum
Use GPU acceleration

variable-bandwidth-practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from scipy.spatial import KDTree
 
class KNNAdaptiveKDE:
    """
    k-Nearest Neighbor Adaptive Kernel Density Estimator.
    
    Uses the distance to the k-th nearest neighbor as the
    local bandwidth at each evaluation point.
    """
    
    def __init__(self, k=None, kernel='gaussian'):
        """
        Parameters:
        -----------
        k : int or None
            Number of neighbors. If None, uses sqrt(n).
        kernel : str
            Kernel type: 'gaussian' or 'epanechnikov'
        """
        self.k = k
        self.kernel = kernel
    
    def fit(self, data):
        """Build the KD-tree for efficient neighbor queries."""
        self.data = np.atleast_2d(data)
        if self.data.shape[0] == 1:
            self.data = self.data.T
        self.n, self.d = self.data.shape
        
        # Default k
        if self.k is None:
            self.k = max(1, int(np.sqrt(self.n)))
        
        # Build KD-tree
        self.tree = KDTree(self.data)
        
        return self
    
    def _kernel(self, u):
        """Evaluate kernel at scaled distances u."""
        if self.kernel == 'gaussian':
            return np.exp(-0.5 * u**2) / np.sqrt(2 * np.pi)
        elif self.kernel == 'epanechnikov':
            return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0)
    
    def evaluate(self, x):
        """Evaluate adaptive KDE at points x."""
        x = np.atleast_2d(x)
        if x.shape[0] == 1 and self.d > 1:
            x = x.T
        elif x.shape[1] != self.d:
            x = x.reshape(-1, self.d)
        
        m = x.shape[0]
        density = np.zeros(m)
        
        for i in range(m):
            # Find k-th nearest neighbor distance
            distances, _ = self.tree.query(x[i], k=self.k + 1)
            h_local = distances[-1]  # k-th neighbor (excluding self)
            
            if h_local < 1e-10:
                h_local = 1e-10  # Prevent division by zero
            
            # Evaluate kernel at all data points with this bandwidth
            if self.d == 1:
                u = (x[i] - self.data.flatten()) / h_local
            else:
                u = np.linalg.norm(x[i] - self.data, axis=1) / h_local
            
            density[i] = np.mean(self._kernel(u)) / h_local**self.d
        
        return density
 
def compare_kde_methods(data, x_eval):
    """Compare fixed, Abramson, and k-NN adaptive KDE."""
    from scipy.stats import gaussian_kde
    
    n = len(data)
    results = {}
    
    # 1. Fixed bandwidth (Silverman)
    fixed_kde = gaussian_kde(data, bw_method='silverman')
    results['fixed'] = fixed_kde(x_eval)
    results['fixed_bw'] = fixed_kde.factor * np.std(data)
    
    # 2. Abramson adaptive
    abramson = AbramsonKDE().fit(data)
    results['abramson'] = abramson.evaluate(x_eval)
    results['abramson_bw_range'] = (
        abramson.h_local.min(), 
        abramson.h_local.max()
    )
    
    # 3. k-NN adaptive
    knn_kde = KNNAdaptiveKDE(k=int(np.sqrt(n))).fit(data.reshape(-1, 1))
    results['knn'] = knn_kde.evaluate(x_eval.reshape(-1, 1))
    
    return results
 
# Example comparison
np.random.seed(42)
 
# Create challenging density: sharp peak + broad region + tail
data = np.concatenate([
    np.random.normal(-2, 0.2, 300),    # Very sharp
    np.random.normal(1, 1.5, 150),      # Very broad
    np.random.exponential(2, 100) + 3   # Long tail
])
 
x_eval = np.linspace(-4, 15, 300)
results = compare_kde_methods(data, x_eval)
 
print("Comparison of KDE Methods:")
print(f"Fixed bandwidth: {results['fixed_bw']:.4f}")
print(f"Abramson bandwidth range: {results['abramson_bw_range'][0]:.4f} - {results['abramson_bw_range'][1]:.4f}")
print(f"Abramson adapts by factor of {results['abramson_bw_range'][1]/results['abramson_bw_range'][0]:.1f}x")

7.2 Numerical Stability

Floor densities: When computing $f^{-1/2}$, very small pilot densities cause bandwidth explosion. Apply a floor: $\tilde{f}(x) \leftarrow \max(\tilde{f}(x), \epsilon)$.
Cap bandwidth ratios: Prevent extreme bandwidth variation by capping: $h_i \leftarrow \text{clip}(h_i, h_0/r, h_0 \cdot r)$ for some ratio $r$ (e.g., $r = 5$).
Sensitivity to outliers: Outliers get large bandwidths (low pilot density). This may or may not be desired. Consider robust pilot methods.

7.3 Software Availability

R: The ks package provides excellent adaptive KDE via kde() with adaptive=TRUE.
Python: KDEpy has some adaptive features; otherwise implement manually.
MATLAB: The ksdensity function has limited adaptive options; manual implementation often needed.

Summary: Variable Bandwidth Methods

Variable bandwidth methods address a fundamental limitation of fixed-bandwidth KDE: the inability to optimally smooth regions with different local characteristics. Here are the key takeaways:

Key Takeaways

•The optimal bandwidth varies with location — In high-density regions, use smaller bandwidth; in low-density regions, use larger bandwidth.
•Two approaches exist: Balloon estimators (bandwidth varies with evaluation point) and sample smoothing (bandwidth varies with data point).
•Abramson's square-root law is foundational — Bandwidth proportional to $f^{-1/2}$ equalizes bias across the domain.
•k-NN methods are simple and effective — Using distance to k-th neighbor as bandwidth automatically adapts to local density.
•Adaptation incurs costs — More complex implementation, loss of FFT acceleration, and additional tuning parameters.
•Benefits are greatest for heterogeneous densities — If the target density is relatively homogeneous, fixed bandwidth may suffice.
•Sample smoothing produces proper densities — Prefer this approach when the estimate must integrate to 1.

Module Complete!