Loading content...
Throughout this module, we've been using a single global bandwidth $h$ for all data points and all evaluation locations. This global bandwidth is a compromise—optimal on average but potentially far from optimal in specific regions of the density.
The fundamental problem:
Consider a density with both sharp peaks and broad, smooth regions:
Variable bandwidth methods address this by allowing the amount of smoothing to adapt to local conditions. The bandwidth becomes a function of location or data point, written as $h(x)$ or $h_i$, rather than a fixed constant.
By the end of this page, you will understand the motivation and theoretical justification for variable bandwidth methods, the distinction between balloon estimators and sample-smoothing approaches, Abramson's square-root law for optimal local bandwidth, nearest-neighbor bandwidth selection, local likelihood and local polynomial methods, practical implementation considerations, and the tradeoffs involved in using variable vs. fixed bandwidth.
The case for variable bandwidth stems from three interconnected observations about fixed-bandwidth KDE.
1.1 Bias Varies with Location
Recall from our bias analysis that the local bias is:
$$\text{Bias}(\hat{f}_h(x)) \approx \frac{h^2}{2} f''(x) \mu_2(K)$$
The bias depends on the local curvature $f''(x)$. Where the density is highly curved (peaks, valleys), a smaller bandwidth is needed to keep bias low. Where the density is nearly flat, a larger bandwidth can be used without significant bias penalty, while reducing variance.
1.2 Variance Varies with Density Level
The local variance is:
$$\text{Var}(\hat{f}_h(x)) \approx \frac{f(x) R(K)}{nh}$$
Variance is proportional to the density itself, $f(x)$. In high-density regions, there are many nearby points contributing to the estimate, so we can afford smaller bandwidth. In low-density regions (tails), fewer points are available, and we need larger bandwidth to reduce variance.
High-density regions typically require smaller bandwidth (more data to smooth, want to preserve peaks), while low-density regions require larger bandwidth (less data available, need to borrow strength). This is precisely the opposite of what a fixed bandwidth provides—fixed h gives the same smoothing everywhere regardless of local data availability.
1.3 The Optimal Local Bandwidth
Minimizing pointwise MSE with respect to $h$ gives the locally optimal bandwidth:
$$h_{\text{opt}}(x) = \left(\frac{f(x) R(K)}{n \mu_2(K)^2 (f''(x))^2}\right)^{1/5}$$
This varies with location through both $f(x)$ and $f''(x)$. A truly optimal estimator would use this location-varying bandwidth.
The challenge: We don't know $f(x)$ or $f''(x)$ ahead of time—these are what we're trying to estimate!
Variable bandwidth methods approximate this ideal by:
Variable bandwidth methods fall into two distinct categories, depending on whether the bandwidth adapts to the evaluation point or to the data point.
2.1 Balloon Estimators (Point Adaptive)
In balloon estimators, the bandwidth is a function of the evaluation point $x$:
$$\hat{f}B(x) = \frac{1}{n h(x)} \sum{i=1}^{n} K\left(\frac{x - X_i}{h(x)}\right)$$
The 'balloon' analogy: imagine inflating a balloon at each evaluation point until it captures a fixed number of data points.
Properties:
Common implementation: $k$-nearest neighbor bandwidth, where $h(x)$ is the distance to the $k$-th nearest neighbor from $x$.
2.2 Sample Smoothing Estimators (Data Adaptive)
In sample smoothing, each data point $X_i$ has its own bandwidth $h_i$:
$$\hat{f}S(x) = \frac{1}{n} \sum{i=1}^{n} \frac{1}{h_i} K\left(\frac{x - X_i}{h_i}\right)$$
Properties:
Common implementation: Abramson's square-root law (discussed below).
| Aspect | Balloon Estimator | Sample Smoothing |
|---|---|---|
| Bandwidth varies with | Evaluation point $x$ | Data point $X_i$ |
| Integrates to 1? | No (must renormalize) | Yes (automatically) |
| Interpretation | Adaptive neighborhoods | Adaptive contributions |
| Theoretical grounding | Good | Excellent (Abramson) |
| Computation | Must compute $h(x)$ at each evaluation | Precompute $h_i$ for all data |
| Common method | $k$-NN bandwidth | Square-root law |
Sample smoothing estimators are generally preferred in the statistical literature because they automatically integrate to 1 and have cleaner theoretical properties. Balloon estimators are popular in machine learning applications where computational simplicity matters more than precise density estimation. For formal statistical inference, use sample smoothing; for exploratory analysis or anomaly detection, balloon estimators are often adequate.
In 1982, Ian Abramson proved a remarkable result that provides the theoretical foundation for adaptive bandwidth selection. His square-root law shows that the optimal local bandwidth should be inversely proportional to the square root of the density.
Abramson's Theorem:
For densities with bounded, continuous derivatives, the bandwidth:
$$h_i = \frac{h_0}{\sqrt{f(X_i)/g}}$$
achieves a bias that is constant across the domain, where:
The Result:
With this bandwidth choice, the bias becomes: $$\text{Bias}(\hat{f}(x)) = \frac{h_0^2}{2} f''(x) \cdot f(x)^{-1} \cdot g \cdot \mu_2(K) + O(h_0^4)$$
which, remarkably, has the same order in $h_0$ everywhere, independent of the local density level.
Why the square root? Recall that variance scales as $f(x)/h$ and bias² scales as $h^4 (f'')^2$. In high-density regions, we can use smaller $h$ (reducing bias) while maintaining acceptable variance because many points are nearby. The square-root relationship $h \propto f^{-1/2}$ is the mathematically optimal balance that equalizes the bias-variance tradeoff across density levels.
Practical Implementation:
Since $f(X_i)$ is unknown, we use a pilot estimate:
Algorithm (Abramson Adaptive Bandwidth):
Stage 1 (Pilot): Compute a fixed-bandwidth KDE with bandwidth $h_{\text{pilot}}$: $$\tilde{f}(x) = \frac{1}{n h_{\text{pilot}}} \sum_{i=1}^{n} K\left(\frac{x - X_i}{h_{\text{pilot}}}\right)$$
Compute local factors: For each data point, compute: $$\lambda_i = \left(\frac{\tilde{f}(X_i)}{g}\right)^{-1/2}$$ where $g = \left(\prod_{i=1}^n \tilde{f}(X_i)\right)^{1/n}$
Stage 2 (Adaptive): The adaptive KDE is: $$\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_0 \lambda_i} K\left(\frac{x - X_i}{h_0 \lambda_i}\right)$$
Choosing $h_0$ and $h_{\text{pilot}}$:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom scipy.stats import norm class AbramsonKDE: """ Adaptive Kernel Density Estimator using Abramson's square-root law. The bandwidth at each data point is inversely proportional to the square root of a pilot density estimate, which optimally balances the bias-variance tradeoff across regions of varying density. """ def __init__(self, pilot_bw='silverman', sensitivity=0.5): """ Parameters: ----------- pilot_bw : str or float Method for pilot bandwidth ('silverman', 'scott') or fixed value sensitivity : float Controls how strongly bandwidth adapts (0.5 = Abramson optimal) Higher values → more adaptation """ self.pilot_bw = pilot_bw self.alpha = sensitivity # Abramson uses 0.5 def fit(self, data): """Fit the adaptive KDE to data.""" self.data = np.atleast_1d(data).flatten() self.n = len(self.data) # Step 1: Compute pilot bandwidth std = np.std(self.data, ddof=1) iqr = np.percentile(self.data, 75) - np.percentile(self.data, 25) if self.pilot_bw == 'silverman': self.h_pilot = 0.9 * min(std, iqr/1.34) * self.n**(-0.2) elif self.pilot_bw == 'scott': self.h_pilot = 1.06 * std * self.n**(-0.2) else: self.h_pilot = float(self.pilot_bw) # Step 2: Compute pilot density at each data point pilot_densities = np.zeros(self.n) for i, xi in enumerate(self.data): u = (xi - self.data) / self.h_pilot pilot_densities[i] = np.mean(norm.pdf(u)) / self.h_pilot # Step 3: Compute local bandwidth factors (Abramson's law) # Avoid log(0) by flooring pilot densities pilot_densities = np.maximum(pilot_densities, 1e-10) # Geometric mean of pilot densities log_g = np.mean(np.log(pilot_densities)) g = np.exp(log_g) # Local bandwidth factors: lambda_i = (f_pilot(X_i) / g)^(-alpha) self.lambdas = (pilot_densities / g) ** (-self.alpha) # Step 4: Global bandwidth h0 (use pilot bandwidth as baseline) # Could also use cross-validation here self.h0 = self.h_pilot # Effective bandwidths for each data point self.h_local = self.h0 * self.lambdas return self def evaluate(self, x): """Evaluate the adaptive KDE at points x.""" x = np.atleast_1d(x).flatten() density = np.zeros(len(x)) for i, xi in enumerate(x): # Each data point contributes with its own bandwidth u = (xi - self.data) / self.h_local contributions = norm.pdf(u) / self.h_local density[i] = np.mean(contributions) return density def get_bandwidth_summary(self): """Return summary statistics of local bandwidths.""" return { 'min': np.min(self.h_local), 'max': np.max(self.h_local), 'mean': np.mean(self.h_local), 'median': np.median(self.h_local), 'ratio_max_min': np.max(self.h_local) / np.min(self.h_local) } # Example: Compare fixed vs adaptive KDEnp.random.seed(42) # Mixture with varying local densitydata = np.concatenate([ np.random.normal(-3, 0.3, 200), # Sharp peak np.random.normal(0, 1.5, 100), # Broad region np.random.normal(4, 0.5, 150) # Medium peak]) # Fixed bandwidth KDE (Silverman)h_fixed = 0.9 * min(np.std(data), (np.percentile(data, 75) - np.percentile(data, 25))/1.34 ) * len(data)**(-0.2) # Adaptive KDEadaptive_kde = AbramsonKDE(pilot_bw='silverman').fit(data)bw_summary = adaptive_kde.get_bandwidth_summary() print("Bandwidth Summary:")print(f" Fixed bandwidth: {h_fixed:.4f}")print(f" Adaptive - min: {bw_summary['min']:.4f}")print(f" Adaptive - max: {bw_summary['max']:.4f}")print(f" Adaptive - median: {bw_summary['median']:.4f}")print(f" Adaptive range ratio: {bw_summary['ratio_max_min']:.2f}x") # Evaluate on gridx_grid = np.linspace(-6, 8, 200)density_adaptive = adaptive_kde.evaluate(x_grid) # Fixed bandwidth density for comparisonfrom scipy.stats import gaussian_kdefixed_kde = gaussian_kde(data, bw_method=h_fixed / np.std(data))density_fixed = fixed_kde(x_grid)An alternative to the density-based approach of Abramson is to define bandwidth in terms of nearest neighbors. This creates balloon estimators that automatically adapt to local data density.
4.1 The $k$-Nearest Neighbor (k-NN) Estimator
For an evaluation point $x$, let $d_k(x)$ be the distance to the $k$-th nearest neighbor in the sample. The k-NN density estimate is:
$$\hat{f}_{kNN}(x) = \frac{k}{n \cdot V_d \cdot d_k(x)^d}$$
where $V_d$ is the volume of the unit ball in $d$ dimensions.
Intuition: In high-density regions, the $k$-th neighbor is close (small $d_k(x)$), giving high estimated density. In low-density regions, the $k$-th neighbor is far, giving low estimated density.
4.2 k-NN as Variable Bandwidth KDE
We can also use the k-NN distance as the bandwidth in a KDE:
$$\hat{f}(x) = \frac{1}{n \cdot d_k(x)} \sum_{i=1}^{n} K\left(\frac{x - X_i}{d_k(x)}\right)$$
This combines the interpretability of KDE with the adaptive properties of k-NN.
4.3 Choosing $k$
The parameter $k$ plays a role analogous to bandwidth:
Rules of thumb for k:
4.4 Advantages and Disadvantages:
For quick exploratory analysis, k-NN bandwidth is often sufficient and easy to implement. For formal statistical inference or when the density estimate will be used in downstream calculations (e.g., as a likelihood), prefer Abramson's method or other approaches that produce proper densities.
Beyond simple kernel methods, more sophisticated local approaches use polynomial fitting or likelihood methods that automatically adapt to local structure.
5.1 Local Polynomial Density Estimation
Instead of placing a kernel at each data point, fit a local polynomial to the log-density. The local polynomial estimate at $x$ solves:
$$\max_{\beta} \sum_{i=1}^{n} \left[\beta_0 + \beta_1(X_i - x) + \ldots - \exp(\beta_0 + \ldots)\right] K_h(X_i - x)$$
The estimate is $\hat{f}(x) = \exp(\hat{\beta}_0)$.
Advantages:
5.2 Local Likelihood Density Estimation
Local likelihood approaches maximize a locally weighted likelihood:
$$\hat{f}(x) = \arg\max_{\theta} \sum_{i=1}^{n} w_i(x) \cdot \log p(X_i; \theta)$$
where $w_i(x) = K_h(x - X_i)$ are weights based on distance to $x$, and $p(\cdot; \theta)$ is a parametric family (often log-linear or Gaussian).
5.3 Logspline and Polyshrink Estimators
These are spline-based density estimators that use local adaptation implicitly:
Benefits:
5.4 When to Use Local Polynomial Methods?
| Situation | Recommended Approach |
|---|---|
| Simple exploratory analysis | Standard KDE or k-NN |
| Formal inference, publication | Abramson adaptive or local likelihood |
| Boundary issues important | Local polynomial (reduced boundary bias) |
| Very smooth densities | Standard KDE may suffice |
| Complex multimodal structure | Abramson adaptive or local likelihood |
Variable bandwidth estimators have been extensively analyzed theoretically. Here we summarize the key results relevant to practitioners.
6.1 Asymptotic MISE
The Abramson sample-smoothing estimator achieves AMISE:
$$\text{AMISE} = O(n^{-4/5})$$
the same optimal rate as fixed-bandwidth KDE, but with an improved constant. Specifically, the integrated squared bias is reduced because bias is equalized across the domain.
6.2 Bias Reduction
The key theoretical advantage of Abramson's law is bias standardization:
This can dramatically improve performance for densities with varying curvature.
6.3 Adaptation Costs
Not all is free—adaptive methods incur costs:
Pilot estimate variance: Errors in the pilot propagate to bandwidth selection, adding noise.
Two-stage inference: The final estimator's distribution is affected by pilot randomness.
Parameter explosion: Choosing pilot bandwidth, adaptation sensitivity ($\alpha$), and final bandwidth is more complex than just choosing one $h$.
Adaptation helps most when: The target density has a wide range of local densities and curvatures (e.g., a mixture of sharp and broad components). Adaptation helps least when: The density is roughly uniform in structure (e.g., unimodal, symmetric). In this case, the added complexity of adaptation may not be worth the modest improvement, and standard KDE is often adequate.
6.4 Convergence Rate Comparison
| Method | MISE Rate | Remarks |
|---|---|---|
| Fixed bandwidth (optimal $h$) | $O(n^{-4/5})$ | Best possible for second-order kernels |
| Abramson adaptive | $O(n^{-4/5})$ | Same rate, better constant for heterogeneous densities |
| k-NN balloon | $O(n^{-4/(d+4)})$ | Standard rate, but suboptimal constant |
| Local polynomial (order $p$) | $O(n^{-2(p+1)/(2p+d+2)})$ | Can achieve faster rates with higher order |
Key insight: For second-order kernels, you can't beat the $n^{-4/5}$ rate (in 1D) regardless of variable bandwidth. The advantage is in the constant factor and robustness to density heterogeneity, not in the rate itself.
Implementing variable bandwidth estimators requires attention to several practical details.
7.1 Computational Complexity
| Method | Precomputation | Evaluation at $m$ points |
|---|---|---|
| Fixed KDE (naive) | — | $O(nm)$ |
| Fixed KDE (FFT) | $O(n + G \log G)$ | $O(m)$ with interpolation |
| Abramson adaptive | $O(n^2)$ (pilot at data points) | $O(nm)$ (no FFT possible) |
| k-NN adaptive | $O(n \log n)$ (build tree) | $O(km \log n)$ |
Key issue: Variable bandwidth breaks the convolution structure that enables FFT acceleration. For large datasets, this can be a significant bottleneck.
Mitigation strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npfrom scipy.spatial import KDTree class KNNAdaptiveKDE: """ k-Nearest Neighbor Adaptive Kernel Density Estimator. Uses the distance to the k-th nearest neighbor as the local bandwidth at each evaluation point. """ def __init__(self, k=None, kernel='gaussian'): """ Parameters: ----------- k : int or None Number of neighbors. If None, uses sqrt(n). kernel : str Kernel type: 'gaussian' or 'epanechnikov' """ self.k = k self.kernel = kernel def fit(self, data): """Build the KD-tree for efficient neighbor queries.""" self.data = np.atleast_2d(data) if self.data.shape[0] == 1: self.data = self.data.T self.n, self.d = self.data.shape # Default k if self.k is None: self.k = max(1, int(np.sqrt(self.n))) # Build KD-tree self.tree = KDTree(self.data) return self def _kernel(self, u): """Evaluate kernel at scaled distances u.""" if self.kernel == 'gaussian': return np.exp(-0.5 * u**2) / np.sqrt(2 * np.pi) elif self.kernel == 'epanechnikov': return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0) def evaluate(self, x): """Evaluate adaptive KDE at points x.""" x = np.atleast_2d(x) if x.shape[0] == 1 and self.d > 1: x = x.T elif x.shape[1] != self.d: x = x.reshape(-1, self.d) m = x.shape[0] density = np.zeros(m) for i in range(m): # Find k-th nearest neighbor distance distances, _ = self.tree.query(x[i], k=self.k + 1) h_local = distances[-1] # k-th neighbor (excluding self) if h_local < 1e-10: h_local = 1e-10 # Prevent division by zero # Evaluate kernel at all data points with this bandwidth if self.d == 1: u = (x[i] - self.data.flatten()) / h_local else: u = np.linalg.norm(x[i] - self.data, axis=1) / h_local density[i] = np.mean(self._kernel(u)) / h_local**self.d return density def compare_kde_methods(data, x_eval): """Compare fixed, Abramson, and k-NN adaptive KDE.""" from scipy.stats import gaussian_kde n = len(data) results = {} # 1. Fixed bandwidth (Silverman) fixed_kde = gaussian_kde(data, bw_method='silverman') results['fixed'] = fixed_kde(x_eval) results['fixed_bw'] = fixed_kde.factor * np.std(data) # 2. Abramson adaptive abramson = AbramsonKDE().fit(data) results['abramson'] = abramson.evaluate(x_eval) results['abramson_bw_range'] = ( abramson.h_local.min(), abramson.h_local.max() ) # 3. k-NN adaptive knn_kde = KNNAdaptiveKDE(k=int(np.sqrt(n))).fit(data.reshape(-1, 1)) results['knn'] = knn_kde.evaluate(x_eval.reshape(-1, 1)) return results # Example comparisonnp.random.seed(42) # Create challenging density: sharp peak + broad region + taildata = np.concatenate([ np.random.normal(-2, 0.2, 300), # Very sharp np.random.normal(1, 1.5, 150), # Very broad np.random.exponential(2, 100) + 3 # Long tail]) x_eval = np.linspace(-4, 15, 300)results = compare_kde_methods(data, x_eval) print("Comparison of KDE Methods:")print(f"Fixed bandwidth: {results['fixed_bw']:.4f}")print(f"Abramson bandwidth range: {results['abramson_bw_range'][0]:.4f} - {results['abramson_bw_range'][1]:.4f}")print(f"Abramson adapts by factor of {results['abramson_bw_range'][1]/results['abramson_bw_range'][0]:.1f}x")7.2 Numerical Stability
Floor densities: When computing $f^{-1/2}$, very small pilot densities cause bandwidth explosion. Apply a floor: $\tilde{f}(x) \leftarrow \max(\tilde{f}(x), \epsilon)$.
Cap bandwidth ratios: Prevent extreme bandwidth variation by capping: $h_i \leftarrow \text{clip}(h_i, h_0/r, h_0 \cdot r)$ for some ratio $r$ (e.g., $r = 5$).
Sensitivity to outliers: Outliers get large bandwidths (low pilot density). This may or may not be desired. Consider robust pilot methods.
7.3 Software Availability
ks package provides excellent adaptive KDE via kde() with adaptive=TRUE.KDEpy has some adaptive features; otherwise implement manually.ksdensity function has limited adaptive options; manual implementation often needed.Variable bandwidth methods address a fundamental limitation of fixed-bandwidth KDE: the inability to optimally smooth regions with different local characteristics. Here are the key takeaways:
You have completed Module 2: KDE Theory. You now have a deep understanding of kernel functions, the bias-variance tradeoff, bandwidth selection methods, the curse of dimensionality, and variable bandwidth techniques. This theoretical foundation prepares you for the next modules on Gaussian Mixture Models and the Expectation-Maximization algorithm, where many of these concepts will reappear in parametric settings.