Machine LearningDensity Estimation Fundamentals

Density Estimation Fundamentals

LevelIntermediate

Duration120 mins

TopicDensity Estimation Fundamentals

3 / 5

Kernel Density Estimation

The Art of Smooth Density Estimation

Kernel Density Estimation (KDE) is the premier nonparametric method for estimating probability density functions. First proposed by Murray Rosenblatt in 1956 and independently by Emanuel Parzen in 1962, KDE elegantly addresses the limitations of histograms while maintaining the flexibility to capture arbitrary distributional shapes.

The core insight is profound in its simplicity: rather than counting points that fall into predefined bins, we place a smooth "bump" at each data point and sum these bumps to create a continuous density estimate. This approach eliminates the histogram's discontinuities, removes dependence on arbitrary bin origins, and achieves faster convergence rates.

KDE has become ubiquitous in modern data science, underpinning applications from visualization tools (density plots, violin plots) to sophisticated machine learning methods (mean shift clustering, anomaly detection, generative modeling). Understanding KDE deeply is essential for any practitioner working with probability distributions.

What You Will Learn

By the end of this page, you will understand the mathematical formulation of KDE, analyze properties of different kernel functions, derive the bias-variance tradeoff and optimal convergence rates, appreciate boundary effects and their corrections, and connect KDE to broader concepts in machine learning including smoothing, regularization, and nonparametric Bayesian methods.

The KDE Formulation

Given observations $\mathcal{D} = {x_1, x_2, \ldots, x_n}$ drawn i.i.d. from unknown density $f(x)$, the kernel density estimator is defined as:

$$\hat{f}h(x) = \frac{1}{nh} \sum{i=1}^{n} K\left(\frac{x - x_i}{h}\right)$$

where:

$K(\cdot)$ is the kernel function — a symmetric, non-negative function integrating to 1
$h > 0$ is the bandwidth (also called smoothing parameter or window width)
The argument $(x - x_i)/h$ measures how far $x$ is from $x_i$ in units of bandwidth

Intuition: At each data point $x_i$, we place a kernel with height $1/(nh)$ and spread $h$. As we move along the $x$-axis, we sum contributions from all kernels. Near clusters of data, many kernels contribute, yielding high density. In sparse regions, few kernels contribute, yielding low density.

Equivalent Form:

Define the scaled kernel $K_h(u) = \frac{1}{h}K\left(\frac{u}{h}\right)$. Then:

$$\hat{f}h(x) = \frac{1}{n} \sum{i=1}^{n} K_h(x - x_i)$$

This shows KDE as an average of kernels centered at each data point.

Why This Works:

Consider the expected value of $\hat{f}_h(x)$ at a point $x$: $$\mathbb{E}[\hat{f}_h(x)] = \mathbb{E}\left[K_h(x - X)\right] = \int K_h(x - t) f(t) , dt = (K_h * f)(x)$$

This is the convolution of the kernel with the true density. For small $h$, $K_h$ approximates a Dirac delta, so $(K_h * f)(x) \approx f(x)$. The kernel acts as a smoothing filter on the density.

Verification of Valid Density:

$$\int \hat{f}h(x) , dx = \frac{1}{n} \sum{i=1}^{n} \int K_h(x - x_i) , dx = \frac{1}{n} \sum_{i=1}^{n} 1 = 1$$

Since each kernel integrates to 1, so does their average. Also, $\hat{f}_h(x) \geq 0$ everywhere (assuming $K \geq 0$), so KDE produces a valid probability density.

The Key Difference from Histograms

In histograms, each point contributes only to its bin—a discontinuous, all-or-nothing contribution based on arbitrary boundaries. In KDE, each point contributes smoothly to the entire density estimate, with contribution decaying with distance through the kernel function. This smooth spreading eliminates boundary artifacts and produces continuous, differentiable estimates.

KDE Components Overview
Component	Symbol	Role	Key Properties
Kernel Function	K(u)	Shape of each bump	Symmetric, ∫K=1, K≥0
Bandwidth	h	Width of each bump	Controls smoothness
Scaled Kernel	K_h(u) = K(u/h)/h	Actual contribution	Integrates to 1
Data Point	x_i	Center of bump i	Location where kernel is placed
Estimate	f̂_h(x)	Sum of all bumps	Valid PDF, smooth (if K is)

Kernel Functions

A kernel function $K(u)$ must satisfy the following conditions:

Non-negativity: $K(u) \geq 0$ for all $u$ (ensures $\hat{f}_h \geq 0$)
Normalization: $\int_{-\infty}^{\infty} K(u) , du = 1$ (ensures $\int \hat{f}_h = 1$)
Symmetry: $K(-u) = K(u)$ for all $u$ (ensures unbiased location)

Additionally, most kernels satisfy: 4. Zero mean: $\int u \cdot K(u) , du = 0$ (from symmetry) 5. Finite variance: $\int u^2 \cdot K(u) , du = \mu_2(K) < \infty$

The quantity $\mu_2(K) = \int u^2 K(u) , du$ is the second moment of the kernel, which appears in bias calculations.

Common Kernel Functions
Kernel	Formula K(u)	Support	Properties
Gaussian	(2π)⁻¹/² exp(-u²/2)	(-∞, ∞)	Smooth, infinitely differentiable
Epanechnikov	¾(1-u²) 𝟙(\|u\|≤1)	[-1, 1]	MISE-optimal, parabolic
Uniform (Box)	½ 𝟙(\|u\|≤1)	[-1, 1]	Simplest, rectangular
Triangular	(1-\|u\|) 𝟙(\|u\|≤1)	[-1, 1]	Continuous, linear decay
Quartic (Biweight)	(15/16)(1-u²)² 𝟙(\|u\|≤1)	[-1, 1]	Smooth at boundaries
Triweight	(35/32)(1-u²)³ 𝟙(\|u\|≤1)	[-1, 1]	Very smooth
Cosine	(π/4)cos(πu/2) 𝟙(\|u\|≤1)	[-1, 1]	Oscillating but smooth

The Gaussian Kernel:

The Gaussian kernel is by far the most widely used:

$$K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$$

Why Gaussian is Popular:

Infinite differentiability: $\hat{f}_h$ inherits all derivatives of $K$, making KDE with Gaussian kernel infinitely smooth.
Infinite support: Every data point contributes (albeit negligibly) to the estimate everywhere. No arbitrary cutoffs.
Mathematical convenience: Convolution of Gaussians is Gaussian. Many theoretical results simplify.
Connection to diffusion: KDE with Gaussian kernel can be viewed as heat diffusion from point sources.
Ubiquitous availability: Every statistical software implements Gaussian KDE.

Disadvantage: Unbounded support means computing $\hat{f}_h(x)$ technically requires summing $n$ terms, though in practice far-away points contribute negligibly.

The Epanechnikov Kernel:

$$K(u) = \frac{3}{4}(1 - u^2) \cdot \mathbf{1}(|u| \leq 1)$$

This parabola (inverted) is theoretically optimal in the sense of minimizing MISE among all kernels when bandwidth is optimally chosen. The proof, due to Epanechnikov (1969), shows that the optimal kernel shape for minimizing $\int \text{Var}[\hat{f}_h(x)] dx$ subject to a fixed $\int \text{Bias}^2[\hat{f}_h(x)] dx$ is a parabola.

In Practice:

Despite its optimality, the efficiency gain over Gaussian is small—approximately 5% in MISE. Since kernel choice matters much less than bandwidth selection, the convenience of Gaussian often outweighs the marginal efficiency of Epanechnikov.

Bounded Support Advantage: For compact-support kernels (Epanechnikov, Uniform, etc.), computing $\hat{f}_h(x)$ only requires summing over points within distance $h$ of $x$, enabling $O(n)$ or better algorithms with spatial data structures.

Kernel Efficiency Comparison

The 'efficiency' of a kernel measures its MISE relative to Epanechnikov. For smooth densities: Epanechnikov = 1.000, Biweight = 0.994, Triweight = 0.987, Gaussian = 0.951, Triangular = 0.986, Uniform = 0.930. These differences are small—kernel choice is far less important than bandwidth choice.

Bias and Variance Analysis

The statistical properties of KDE reveal precisely how bandwidth controls the bias-variance tradeoff.

Expected Value (Bias Derivation):

Recall $\mathbb{E}[\hat{f}_h(x)] = \int K_h(x-t) f(t) , dt$. Using Taylor expansion of $f(t)$ around $x$:

$$f(t) = f(x) + (t-x)f'(x) + \frac{(t-x)^2}{2}f''(x) + O((t-x)^3)$$

Substituting and using symmetry of $K$ (which zeros the $f'$ term):

$$\mathbb{E}[\hat{f}_h(x)] = f(x) \int K(u) , du + \frac{h^2 f''(x)}{2} \int u^2 K(u) , du + O(h^4)$$

$$= f(x) + \frac{h^2 f''(x)}{2} \mu_2(K) + O(h^4)$$

where $\mu_2(K) = \int u^2 K(u) , du$ is the kernel's second moment.

Bias: $$\text{Bias}[\hat{f}_h(x)] = \mathbb{E}[\hat{f}_h(x)] - f(x) \approx \frac{h^2}{2} f''(x) \mu_2(K)$$

Bias is proportional to:

$h^2$: smaller bandwidth → smaller bias
$f''(x)$: high curvature regions → more bias
$\mu_2(K)$: wider kernels → more bias

Variance Derivation:

Since $\hat{f}h(x) = \frac{1}{n}\sum{i=1}^n K_h(x-x_i)$ is an average of i.i.d. random variables:

$$\text{Var}[\hat{f}_h(x)] = \frac{1}{n} \text{Var}[K_h(x-X)] = \frac{1}{n}\left(\mathbb{E}[K_h^2(x-X)] - \mathbb{E}[K_h(x-X)]^2\right)$$

For small $h$, the second term is approximately $f(x)^2$, and:

$$\mathbb{E}[K_h^2(x-X)] = \int K_h^2(x-t) f(t) , dt = \frac{1}{h^2} \int K^2\left(\frac{x-t}{h}\right) f(t) , dt$$

$$= \frac{f(x)}{h} \int K^2(u) , du + O(1) = \frac{f(x) R(K)}{h}$$

where $R(K) = \int K^2(u) , du$ is the roughness of the kernel.

Variance: $$\text{Var}[\hat{f}_h(x)] \approx \frac{f(x) R(K)}{nh}$$

Variance is proportional to:

$1/(nh)$: more data or wider bandwidth → less variance
$f(x)$: higher density → more variance
$R(K)$: rougher kernels → more variance

Mean Squared Error (MSE):

$$\text{MSE}[\hat{f}_h(x)] = \text{Bias}^2 + \text{Var} \approx \frac{h^4}{4} [f''(x)]^2 [\mu_2(K)]^2 + \frac{f(x) R(K)}{nh}$$

Mean Integrated Squared Error (MISE):

$$\text{MISE}(\hat{f}_h) = \int \text{MSE}[\hat{f}_h(x)] , dx$$

$$\approx \frac{h^4}{4} [\mu_2(K)]^2 R(f'') + \frac{R(K)}{nh}$$

where $R(f'') = \int [f''(x)]^2 , dx$ is the roughness of the true density's second derivative.

Small Bandwidth (h → 0)

•Bias² ∝ h⁴ → 0 (good)
•Variance ∝ 1/(nh) → ∞ (bad)
•Spiky, noisy estimates
•Undersmoothing
•Mode at each data point

Large Bandwidth (h → ∞)

•Bias² ∝ h⁴ → ∞ (bad)
•Variance ∝ 1/(nh) → 0 (good)
•Flat, featureless estimates
•Oversmoothing
•Obscures true structure

The Goldilocks Problem

Finding the optimal bandwidth is like Goldilocks finding the right porridge—too small leads to spiky chaos, too large leads to bland uniformity. The optimal bandwidth strikes a perfect balance where the sum of squared bias and variance is minimized.

Optimal Bandwidth Theory

Deriving the Optimal Bandwidth:

To find the MISE-optimal $h$, differentiate MISE with respect to $h$ and set to zero:

$$\frac{d}{dh} \text{MISE} = h^3 [\mu_2(K)]^2 R(f'') - \frac{R(K)}{nh^2} = 0$$

Solving:

$$h^* = \left(\frac{R(K)}{n [\mu_2(K)]^2 R(f'')}\right)^{1/5}$$

Order Notation: $$h^* = O(n^{-1/5})$$

The optimal bandwidth shrinks as the fifth root of sample size.

Optimal MISE:

$$\text{MISE}^* = \frac{5}{4} \left(\frac{[\mu_2(K)]^2 R(K)^4}{n^4} R(f'')\right)^{1/5} = O(n^{-4/5})$$

Thus KDE converges at rate $n^{-4/5}$, faster than histograms ($n^{-2/3}$) but slower than parametric ($n^{-1}$).

Interpretation of Optimal Bandwidth:

The formula for $h^*$ reveals key insights:

Sample Size Effect: $h^* \propto n^{-1/5}$ — more data allows finer resolution.
- 32× more data → bandwidth halved
- Slow reduction: even 100× data only reduces $h$ by ~60%
Density Roughness Effect: $h^* \propto R(f'')^{-1/5}$ — rough densities need smaller bandwidths.
- High curvature → more rapid changes → need fine resolution
- Flat density → large bandwidth acceptable
Kernel Effect: Through $R(K)$ and $\mu_2(K)$.
- Epanechnikov minimizes the relevant constant
- But differences between kernels are minor

The Chicken-and-Egg Problem:

$R(f'') = \int [f''(x)]^2 dx$ involves the unknown density $f$. We cannot compute the optimal bandwidth exactly without already knowing $f$. This motivates:

Plug-in rules: estimate $R(f'')$ from a preliminary density estimate
Reference rules: assume $f$ is Gaussian and use the known formula
Cross-validation: optimize a criterion that doesn't require $f$

Silverman's Rule of Thumb (1986):

Assuming $f$ is Gaussian $N(\mu, \sigma^2)$, we have $R(f'') = \frac{3}{8\sqrt{\pi}\sigma^5}$.

For the Gaussian kernel with $R(K) = (2\sqrt{\pi})^{-1}$ and $\mu_2(K) = 1$:

$$h^* = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{1/5} \approx 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$

To handle non-normal data, Silverman suggested using a robust scale estimate:

$$h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$$

The factor 0.9 adjusts for the different roughness assumptions. IQR/1.34 equals $\sigma$ for Gaussian but is more robust.

Properties of Silverman's Rule:

Simple to compute
Works well for unimodal, roughly symmetric distributions
Oversmooths multimodal distributions (assumes single mode)

When Reference Rules Fail

Silverman's rule assumes a unimodal Gaussian reference. For multimodal data (mixture distributions), it will oversmooth and merge separate modes into one. For skewed or heavy-tailed data, it may also perform poorly. In such cases, cross-validation or plug-in methods are preferred.

Boundary Effects and Corrections

When the density has bounded support (e.g., positive data, proportions in [0,1]), standard KDE suffers from boundary bias—systematic errors near the edges of the support.

The Problem:

Consider data on $[0, \infty)$. At $x = 0$, the kernel extends into negative values where no data can exist. The integral:

$$\int_{-\infty}^{\infty} K_h(0 - t) f(t) , dt$$

is evaluated over $[0, \infty)$, but the kernel expects contribution from $(-\infty, 0)$ as well. Since $f(t) = 0$ for $t < 0$, the estimate is biased downward near 0.

Quantifying Boundary Bias:

At a boundary point $x = 0$, the relative bias is approximately $O(1)$ rather than $O(h^2)$ in the interior. This is a dramatic increase—the estimate can be biased by 50% or more!

Correction Methods:

1. Reflection Method:

Reflect data across the boundary and include both original and reflected points: $$\hat{f}h(x) = \frac{1}{nh}\sum{i=1}^n \left[K\left(\frac{x-x_i}{h}\right) + K\left(\frac{x+x_i}{h}\right)\right]$$

This implicitly assumes $f'(0) = 0$ (slope zero at boundary).

2. Boundary Kernels:

Use modified kernels near boundaries that integrate to 1 over the restricted domain: $$K_b(u; x, h) = \frac{K(u)}{\int_{L(x)}^{R(x)} K(v) dv}$$

where $L(x)$ and $R(x)$ are the effective kernel limits given the boundary.

3. Pseudodata Methods:

Generate pseudodata beyond the boundary using local linear extrapolation.

4. Local Polynomial Methods:

Fit a local polynomial (rather than constant) at boundary regions, which automatically corrects for boundary effects.

5. Transformation:

Map data to an unbounded domain (e.g., log transform for positive data), apply KDE, and back-transform.

Practical Advice for Bounded Data

The transformation approach is often simplest: for positive data, apply log transform, estimate density on real line, then transform back using the Jacobian. For [0,1] data, use logit transform. This leverages standard KDE machinery while avoiding boundary issues entirely.

Beta Kernel for [0,1] Data:

For data on [0, 1], a specialized solution uses Beta distribution kernels:

$$K_{\text{Beta}}(t; x, h) = \frac{t^{x/h}(1-t)^{(1-x)/h}}{B(x/h + 1, (1-x)/h + 1)}$$

The kernel is centered at $x$ but naturally confined to [0, 1]. As $h \to 0$, it concentrates around $x$. This eliminates boundary bias entirely by construction.

Gamma Kernel for Positive Data:

Similarly, Gamma kernels can be used for data on $[0, \infty)$: $$K_{\text{Gamma}}(t; x, h) = \text{Gamma distribution centered near } x$$

These specialized kernels achieve optimal rates at boundaries but require different bandwidth selection procedures.

Multivariate Kernel Density Estimation

Extending KDE to multiple dimensions is conceptually straightforward but introduces significant computational and statistical challenges.

Product Kernel:

The simplest multivariate extension uses a product of univariate kernels: $$K_H(\mathbf{u}) = \prod_{j=1}^d K(u_j / h_j) / \prod_{j=1}^d h_j$$

with different bandwidths $h_j$ for each dimension. The estimate is: $$\hat{f}H(\mathbf{x}) = \frac{1}{n} \sum{i=1}^n K_H(\mathbf{x} - \mathbf{x}_i)$$

Full Bandwidth Matrix:

More generally, use a bandwidth matrix $H$ (positive definite $d \times d$ matrix): $$K_H(\mathbf{u}) = |H|^{-1/2} K(H^{-1/2}\mathbf{u})$$

where $K(\cdot)$ is a multivariate kernel (often multivariate Gaussian).

For multivariate Gaussian kernel: $$K_H(\mathbf{u}) = (2\pi)^{-d/2} |H|^{-1/2} \exp\left(-\frac{1}{2}\mathbf{u}^T H^{-1} \mathbf{u}\right)$$

The bandwidth matrix $H$ controls smoothing in all directions and correlations.

The Curse of Dimensionality in KDE:

Recall the optimal MISE convergence rate: $$\text{MISE}^* = O(n^{-4/(d+4)})$$

Dimension	Rate	Samples for 10% MISE
1	n⁻⁰·⁸⁰	~100
2	n⁻⁰·⁶⁷	~1,000
3	n⁻⁰·⁵⁷	~10,000
5	n⁻⁰·⁴⁴	~300,000
10	n⁻⁰·²⁹	~10⁸
20	n⁻⁰·¹⁷	~10¹⁴

Beyond $d \approx 5$, KDE becomes impractical without enormous datasets. This is the "curse of dimensionality" in action—high-dimensional spaces are effectively empty, and data become too sparse for nonparametric estimation.

KDE in High Dimensions

For d > 5-6, KDE is rarely appropriate. Alternatives include: (1) Dimensionality reduction then KDE, (2) Gaussian Mixture Models (parametric), (3) Normalizing flows (neural network-based), (4) Nearest-neighbor density estimates. The curse of dimensionality is a fundamental mathematical limitation, not a software bug.

Bandwidth Matrix Selection:

In $d$ dimensions, the full bandwidth matrix $H$ has $d(d+1)/2$ free parameters. This is usually too many to estimate reliably. Common simplifications:

Scalar bandwidth: $H = h^2 I$ — same bandwidth in all directions. Simplest but doesn't adapt to different scales or correlations.
Diagonal bandwidth: $H = \text{diag}(h_1^2, \ldots, h_d^2)$ — different bandwidths per dimension but no correlation. Often standardize data first.
Full matrix: Allow arbitrary positive definite $H$. Most flexible but requires more data to estimate. Often set $H \propto \widehat{\Sigma}$, the sample covariance matrix.

Scott's Rule for multivariate: $$H = n^{-2/(d+4)} \widehat{\Sigma}$$

Silverman's Rule for multivariate: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \widehat{\Sigma}$$

Both scale the sample covariance by a factor depending on $n$ and $d$.

Computational Considerations

Naive KDE evaluation requires $O(n)$ operations per query point—for $m$ query points, total cost is $O(nm)$. With modern datasets having millions of points, this becomes prohibitive.

Computational Complexity:

For $n$ data points and $m$ evaluation points:

Naive: $O(nm)$ kernel evaluations
With Gaussian kernel: Each evaluation involves exponentiating and summing
For visualization on grid: $m$ can be large (e.g., 1000×1000 = 10⁶)

Fast Algorithms:

Acceleration Techniques for KDE

•Binning Approximations: Bin data first, then apply KDE to bin counts. Reduces effective n to number of bins. Error controlled by bin fineness.
•Fast Fourier Transform (FFT): Exploit convolution theorem. KDE is convolution of kernel with empirical distribution. FFT-based convolution is O((n+m) log(n+m)).
•Tree-based Methods (KD-trees, Ball trees): For bounded-support kernels, only nearby points contribute. Spatial data structures enable O(log n) per query.
•Dual-tree Algorithms: Use trees for both data and query points. Can achieve O(n + m) in favorable cases.
•Low-rank Approximations: Approximate the kernel sum using random projections or Nyström method.
•GPU Acceleration: Parallelize kernel evaluations on graphics hardware. Effective for large-scale visualization.

FFT-Based KDE:

The key insight is that KDE at grid points is a discrete convolution: $$\hat{f}(x_k) = \frac{1}{n} \sum_{i=1}^n K_h(x_k - x_i)$$

If we bin the data onto the same grid:

Create histogram counts $c_j$ on grid ${g_1, \ldots, g_m}$
KDE at grid points is approximately $\hat{f}(g_k) \approx \sum_j c_j K_h(g_k - g_j) / n$
This is a convolution: $\hat{f} \approx (c * K_h) / n$
Apply FFT: $\mathcal{F}(\hat{f}) = \mathcal{F}(c) \cdot \mathcal{F}(K_h) / n$
Inverse FFT to get $\hat{f}$

Total complexity: $O(m \log m)$ for the FFT operations, plus $O(n)$ for binning.

Error Analysis: Binning introduces bias proportional to $(\Delta g)^2$ where $\Delta g$ is grid spacing. With grid spacing much smaller than bandwidth, error is negligible.

kde_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from scipy import stats
 
def kde_naive(data, x_grid, bandwidth=None):
    """
    Naive O(n*m) KDE implementation.
    
    Parameters:
    -----------
    data : array-like
        Sample data points
    x_grid : array-like
        Points at which to evaluate density
    bandwidth : float, optional
        Bandwidth (uses Silverman's rule if None)
    
    Returns:
    --------
    density : array
        Estimated density at each grid point
    """
    data = np.asarray(data)
    x_grid = np.asarray(x_grid)
    n = len(data)
    
    # Silverman's rule of thumb
    if bandwidth is None:
        iqr = np.percentile(data, 75) - np.percentile(data, 25)
        bandwidth = 0.9 * min(np.std(data, ddof=1), iqr/1.34) * n**(-0.2)
    
    # Gaussian kernel
    density = np.zeros(len(x_grid))
    for i, x in enumerate(x_grid):
        # Sum kernel contributions from all data points
        u = (x - data) / bandwidth
        density[i] = np.sum(np.exp(-0.5 * u**2)) / (n * bandwidth * np.sqrt(2*np.pi))
    
    return density
 
def kde_fft(data, n_grid=512, bandwidth=None):
    """
    FFT-accelerated KDE for regular grid evaluation.
    
    Much faster for large n when evaluating on grid.
    """
    data = np.asarray(data)
    n = len(data)
    
    # Silverman's rule
    if bandwidth is None:
        iqr = np.percentile(data, 75) - np.percentile(data, 25)
        bandwidth = 0.9 * min(np.std(data, ddof=1), iqr/1.34) * n**(-0.2)
    
    # Create grid with padding for convolution
    data_min, data_max = data.min(), data.max()
    padding = 3 * bandwidth
    grid_min, grid_max = data_min - padding, data_max + padding
    grid = np.linspace(grid_min, grid_max, n_grid)
    delta = grid[1] - grid[0]
    
    # Bin data onto grid
    counts, _ = np.histogram(data, bins=n_grid, range=(grid_min, grid_max))
    
    # Create kernel array
    kernel_grid = np.arange(-(n_grid//2), n_grid//2 + 1) * delta
    kernel = np.exp(-0.5 * (kernel_grid / bandwidth)**2) / (bandwidth * np.sqrt(2*np.pi))
    kernel = kernel[:n_grid]  # Truncate to match grid size
    
    # FFT convolution
    counts_fft = np.fft.fft(counts)
    kernel_fft = np.fft.fft(kernel)
    density = np.real(np.fft.ifft(counts_fft * kernel_fft)) / n
    
    # Shift to align properly
    density = np.fft.fftshift(density)
    
    return grid, np.maximum(density, 0)  # Ensure non-negative
 
# Compare methods
np.random.seed(42)
data = np.concatenate([np.random.normal(0, 1, 5000), np.random.normal(4, 0.5, 3000)])
 
x_grid = np.linspace(-4, 7, 500)
density_naive = kde_naive(data, x_grid)
grid_fft, density_fft = kde_fft(data, n_grid=500)
 
print(f"Data points: {len(data)}")
print(f"Naive max density: {density_naive.max():.4f}")
print(f"FFT max density: {density_fft.max():.4f}")

Summary and Path Forward

Kernel density estimation represents the gold standard for nonparametric density estimation. We've covered its theoretical foundations and practical aspects in depth:

Key Takeaways

•KDE places smooth bumps at each data point and sums them to create a continuous density estimate, eliminating histogram artifacts.
•The Gaussian kernel is most popular for its mathematical convenience and smoothness; Epanechnikov is theoretically optimal but offers only ~5% efficiency gain.
•Bias ∝ h² (curvature-dependent) and Variance ∝ 1/(nh) — bandwidth controls this tradeoff exactly as theory predicts.
•The optimal bandwidth scales as $h^* \propto n^{-1/5}$, yielding MISE ∝ n^{-4/5} — faster than histograms, slower than parametric.
•Boundary effects cause severe bias near domain edges; solutions include reflection, specialized kernels (Beta, Gamma), or transformation.
•Multivariate KDE suffers from the curse of dimensionality: practical only for $d \leq 5-6$.
•Computational costs are O(nm) naively but can be reduced to O((n+m) log(n+m)) via FFT for grid evaluation.

What's Next:

Bandwidth selection is arguably the most critical practical concern in KDE—more important than kernel choice. The next page focuses entirely on bandwidth selection methods, including Silverman's rule, plug-in estimators, and cross-validation approaches. We'll develop both theoretical understanding and practical guidance for choosing bandwidths that balance smoothness against fidelity to data.

Page Complete

You now have comprehensive understanding of kernel density estimation—the workhorse of nonparametric density estimation. From mathematical foundations to computational methods, you're equipped to both apply KDE effectively and understand its theoretical guarantees.

3 / 5

Loading learning content...

Machine LearningDensity Estimation Fundamentals

Density Estimation Fundamentals

LevelIntermediate

Duration120 mins

TopicDensity Estimation Fundamentals

3 / 5

Kernel Density Estimation

The Art of Smooth Density Estimation

What You Will Learn

The KDE Formulation

Given observations $\mathcal{D} = {x_1, x_2, \ldots, x_n}$ drawn i.i.d. from unknown density $f(x)$, the kernel density estimator is defined as:

$$\hat{f}h(x) = \frac{1}{nh} \sum{i=1}^{n} K\left(\frac{x - x_i}{h}\right)$$

where:

$K(\cdot)$ is the kernel function — a symmetric, non-negative function integrating to 1
$h > 0$ is the bandwidth (also called smoothing parameter or window width)
The argument $(x - x_i)/h$ measures how far $x$ is from $x_i$ in units of bandwidth

Equivalent Form:

Define the scaled kernel $K_h(u) = \frac{1}{h}K\left(\frac{u}{h}\right)$. Then:

$$\hat{f}h(x) = \frac{1}{n} \sum{i=1}^{n} K_h(x - x_i)$$

This shows KDE as an average of kernels centered at each data point.

Why This Works:

Consider the expected value of $\hat{f}_h(x)$ at a point $x$: $$\mathbb{E}[\hat{f}_h(x)] = \mathbb{E}\left[K_h(x - X)\right] = \int K_h(x - t) f(t) , dt = (K_h * f)(x)$$

This is the convolution of the kernel with the true density. For small $h$, $K_h$ approximates a Dirac delta, so $(K_h * f)(x) \approx f(x)$. The kernel acts as a smoothing filter on the density.

Verification of Valid Density:

$$\int \hat{f}h(x) , dx = \frac{1}{n} \sum{i=1}^{n} \int K_h(x - x_i) , dx = \frac{1}{n} \sum_{i=1}^{n} 1 = 1$$

Since each kernel integrates to 1, so does their average. Also, $\hat{f}_h(x) \geq 0$ everywhere (assuming $K \geq 0$), so KDE produces a valid probability density.

The Key Difference from Histograms

KDE Components Overview
Component	Symbol	Role	Key Properties
Kernel Function	K(u)	Shape of each bump	Symmetric, ∫K=1, K≥0
Bandwidth	h	Width of each bump	Controls smoothness
Scaled Kernel	K_h(u) = K(u/h)/h	Actual contribution	Integrates to 1
Data Point	x_i	Center of bump i	Location where kernel is placed
Estimate	f̂_h(x)	Sum of all bumps	Valid PDF, smooth (if K is)

Kernel Functions

A kernel function $K(u)$ must satisfy the following conditions:

Non-negativity: $K(u) \geq 0$ for all $u$ (ensures $\hat{f}_h \geq 0$)
Normalization: $\int_{-\infty}^{\infty} K(u) , du = 1$ (ensures $\int \hat{f}_h = 1$)
Symmetry: $K(-u) = K(u)$ for all $u$ (ensures unbiased location)

Additionally, most kernels satisfy: 4. Zero mean: $\int u \cdot K(u) , du = 0$ (from symmetry) 5. Finite variance: $\int u^2 \cdot K(u) , du = \mu_2(K) < \infty$

The quantity $\mu_2(K) = \int u^2 K(u) , du$ is the second moment of the kernel, which appears in bias calculations.

Common Kernel Functions
Kernel	Formula K(u)	Support	Properties
Gaussian	(2π)⁻¹/² exp(-u²/2)	(-∞, ∞)	Smooth, infinitely differentiable
Epanechnikov	¾(1-u²) 𝟙(\|u\|≤1)	[-1, 1]	MISE-optimal, parabolic
Uniform (Box)	½ 𝟙(\|u\|≤1)	[-1, 1]	Simplest, rectangular
Triangular	(1-\|u\|) 𝟙(\|u\|≤1)	[-1, 1]	Continuous, linear decay
Quartic (Biweight)	(15/16)(1-u²)² 𝟙(\|u\|≤1)	[-1, 1]	Smooth at boundaries
Triweight	(35/32)(1-u²)³ 𝟙(\|u\|≤1)	[-1, 1]	Very smooth
Cosine	(π/4)cos(πu/2) 𝟙(\|u\|≤1)	[-1, 1]	Oscillating but smooth

The Gaussian Kernel:

The Gaussian kernel is by far the most widely used:

$$K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$$

Why Gaussian is Popular:

Infinite differentiability: $\hat{f}_h$ inherits all derivatives of $K$, making KDE with Gaussian kernel infinitely smooth.
Infinite support: Every data point contributes (albeit negligibly) to the estimate everywhere. No arbitrary cutoffs.
Mathematical convenience: Convolution of Gaussians is Gaussian. Many theoretical results simplify.
Connection to diffusion: KDE with Gaussian kernel can be viewed as heat diffusion from point sources.
Ubiquitous availability: Every statistical software implements Gaussian KDE.

Disadvantage: Unbounded support means computing $\hat{f}_h(x)$ technically requires summing $n$ terms, though in practice far-away points contribute negligibly.

The Epanechnikov Kernel:

$$K(u) = \frac{3}{4}(1 - u^2) \cdot \mathbf{1}(|u| \leq 1)$$

In Practice:

Kernel Efficiency Comparison

Bias and Variance Analysis

The statistical properties of KDE reveal precisely how bandwidth controls the bias-variance tradeoff.

Expected Value (Bias Derivation):

Recall $\mathbb{E}[\hat{f}_h(x)] = \int K_h(x-t) f(t) , dt$. Using Taylor expansion of $f(t)$ around $x$:

$$f(t) = f(x) + (t-x)f'(x) + \frac{(t-x)^2}{2}f''(x) + O((t-x)^3)$$

Substituting and using symmetry of $K$ (which zeros the $f'$ term):

$$\mathbb{E}[\hat{f}_h(x)] = f(x) \int K(u) , du + \frac{h^2 f''(x)}{2} \int u^2 K(u) , du + O(h^4)$$

$$= f(x) + \frac{h^2 f''(x)}{2} \mu_2(K) + O(h^4)$$

where $\mu_2(K) = \int u^2 K(u) , du$ is the kernel's second moment.

Bias: $$\text{Bias}[\hat{f}_h(x)] = \mathbb{E}[\hat{f}_h(x)] - f(x) \approx \frac{h^2}{2} f''(x) \mu_2(K)$$

Bias is proportional to:

$h^2$: smaller bandwidth → smaller bias
$f''(x)$: high curvature regions → more bias
$\mu_2(K)$: wider kernels → more bias

Variance Derivation:

Since $\hat{f}h(x) = \frac{1}{n}\sum{i=1}^n K_h(x-x_i)$ is an average of i.i.d. random variables:

$$\text{Var}[\hat{f}_h(x)] = \frac{1}{n} \text{Var}[K_h(x-X)] = \frac{1}{n}\left(\mathbb{E}[K_h^2(x-X)] - \mathbb{E}[K_h(x-X)]^2\right)$$

For small $h$, the second term is approximately $f(x)^2$, and:

$$\mathbb{E}[K_h^2(x-X)] = \int K_h^2(x-t) f(t) , dt = \frac{1}{h^2} \int K^2\left(\frac{x-t}{h}\right) f(t) , dt$$

$$= \frac{f(x)}{h} \int K^2(u) , du + O(1) = \frac{f(x) R(K)}{h}$$

where $R(K) = \int K^2(u) , du$ is the roughness of the kernel.

Variance: $$\text{Var}[\hat{f}_h(x)] \approx \frac{f(x) R(K)}{nh}$$

Variance is proportional to:

$1/(nh)$: more data or wider bandwidth → less variance
$f(x)$: higher density → more variance
$R(K)$: rougher kernels → more variance

Mean Squared Error (MSE):

$$\text{MSE}[\hat{f}_h(x)] = \text{Bias}^2 + \text{Var} \approx \frac{h^4}{4} [f''(x)]^2 [\mu_2(K)]^2 + \frac{f(x) R(K)}{nh}$$

Mean Integrated Squared Error (MISE):

$$\text{MISE}(\hat{f}_h) = \int \text{MSE}[\hat{f}_h(x)] , dx$$

$$\approx \frac{h^4}{4} [\mu_2(K)]^2 R(f'') + \frac{R(K)}{nh}$$

where $R(f'') = \int [f''(x)]^2 , dx$ is the roughness of the true density's second derivative.

Small Bandwidth (h → 0)

•Bias² ∝ h⁴ → 0 (good)
•Variance ∝ 1/(nh) → ∞ (bad)
•Spiky, noisy estimates
•Undersmoothing
•Mode at each data point

Large Bandwidth (h → ∞)

•Bias² ∝ h⁴ → ∞ (bad)
•Variance ∝ 1/(nh) → 0 (good)
•Flat, featureless estimates
•Oversmoothing
•Obscures true structure

The Goldilocks Problem

Optimal Bandwidth Theory

Deriving the Optimal Bandwidth:

To find the MISE-optimal $h$, differentiate MISE with respect to $h$ and set to zero:

$$\frac{d}{dh} \text{MISE} = h^3 [\mu_2(K)]^2 R(f'') - \frac{R(K)}{nh^2} = 0$$

Solving:

$$h^* = \left(\frac{R(K)}{n [\mu_2(K)]^2 R(f'')}\right)^{1/5}$$

Order Notation: $$h^* = O(n^{-1/5})$$

The optimal bandwidth shrinks as the fifth root of sample size.

Optimal MISE:

$$\text{MISE}^* = \frac{5}{4} \left(\frac{[\mu_2(K)]^2 R(K)^4}{n^4} R(f'')\right)^{1/5} = O(n^{-4/5})$$

Thus KDE converges at rate $n^{-4/5}$, faster than histograms ($n^{-2/3}$) but slower than parametric ($n^{-1}$).

Interpretation of Optimal Bandwidth:

The formula for $h^*$ reveals key insights:

Sample Size Effect: $h^* \propto n^{-1/5}$ — more data allows finer resolution.
- 32× more data → bandwidth halved
- Slow reduction: even 100× data only reduces $h$ by ~60%
Density Roughness Effect: $h^* \propto R(f'')^{-1/5}$ — rough densities need smaller bandwidths.
- High curvature → more rapid changes → need fine resolution
- Flat density → large bandwidth acceptable
Kernel Effect: Through $R(K)$ and $\mu_2(K)$.
- Epanechnikov minimizes the relevant constant
- But differences between kernels are minor

The Chicken-and-Egg Problem:

$R(f'') = \int [f''(x)]^2 dx$ involves the unknown density $f$. We cannot compute the optimal bandwidth exactly without already knowing $f$. This motivates:

Plug-in rules: estimate $R(f'')$ from a preliminary density estimate
Reference rules: assume $f$ is Gaussian and use the known formula
Cross-validation: optimize a criterion that doesn't require $f$

Silverman's Rule of Thumb (1986):

Assuming $f$ is Gaussian $N(\mu, \sigma^2)$, we have $R(f'') = \frac{3}{8\sqrt{\pi}\sigma^5}$.

For the Gaussian kernel with $R(K) = (2\sqrt{\pi})^{-1}$ and $\mu_2(K) = 1$:

$$h^* = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{1/5} \approx 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$

To handle non-normal data, Silverman suggested using a robust scale estimate:

$$h = 0.9 \cdot \min\left(\hat{\sigma}, \frac{\text{IQR}}{1.34}\right) \cdot n^{-1/5}$$

The factor 0.9 adjusts for the different roughness assumptions. IQR/1.34 equals $\sigma$ for Gaussian but is more robust.

Properties of Silverman's Rule:

Simple to compute
Works well for unimodal, roughly symmetric distributions
Oversmooths multimodal distributions (assumes single mode)

When Reference Rules Fail

Boundary Effects and Corrections

When the density has bounded support (e.g., positive data, proportions in [0,1]), standard KDE suffers from boundary bias—systematic errors near the edges of the support.

The Problem:

Consider data on $[0, \infty)$. At $x = 0$, the kernel extends into negative values where no data can exist. The integral:

$$\int_{-\infty}^{\infty} K_h(0 - t) f(t) , dt$$

is evaluated over $[0, \infty)$, but the kernel expects contribution from $(-\infty, 0)$ as well. Since $f(t) = 0$ for $t < 0$, the estimate is biased downward near 0.

Quantifying Boundary Bias:

At a boundary point $x = 0$, the relative bias is approximately $O(1)$ rather than $O(h^2)$ in the interior. This is a dramatic increase—the estimate can be biased by 50% or more!

Correction Methods:

1. Reflection Method:

Reflect data across the boundary and include both original and reflected points: $$\hat{f}h(x) = \frac{1}{nh}\sum{i=1}^n \left[K\left(\frac{x-x_i}{h}\right) + K\left(\frac{x+x_i}{h}\right)\right]$$

This implicitly assumes $f'(0) = 0$ (slope zero at boundary).

2. Boundary Kernels:

Use modified kernels near boundaries that integrate to 1 over the restricted domain: $$K_b(u; x, h) = \frac{K(u)}{\int_{L(x)}^{R(x)} K(v) dv}$$

where $L(x)$ and $R(x)$ are the effective kernel limits given the boundary.

3. Pseudodata Methods:

Generate pseudodata beyond the boundary using local linear extrapolation.

4. Local Polynomial Methods:

Fit a local polynomial (rather than constant) at boundary regions, which automatically corrects for boundary effects.

5. Transformation:

Map data to an unbounded domain (e.g., log transform for positive data), apply KDE, and back-transform.

Practical Advice for Bounded Data

Beta Kernel for [0,1] Data:

For data on [0, 1], a specialized solution uses Beta distribution kernels:

$$K_{\text{Beta}}(t; x, h) = \frac{t^{x/h}(1-t)^{(1-x)/h}}{B(x/h + 1, (1-x)/h + 1)}$$

The kernel is centered at $x$ but naturally confined to [0, 1]. As $h \to 0$, it concentrates around $x$. This eliminates boundary bias entirely by construction.

Gamma Kernel for Positive Data:

Similarly, Gamma kernels can be used for data on $[0, \infty)$: $$K_{\text{Gamma}}(t; x, h) = \text{Gamma distribution centered near } x$$

These specialized kernels achieve optimal rates at boundaries but require different bandwidth selection procedures.

Multivariate Kernel Density Estimation

Extending KDE to multiple dimensions is conceptually straightforward but introduces significant computational and statistical challenges.

Product Kernel:

The simplest multivariate extension uses a product of univariate kernels: $$K_H(\mathbf{u}) = \prod_{j=1}^d K(u_j / h_j) / \prod_{j=1}^d h_j$$

with different bandwidths $h_j$ for each dimension. The estimate is: $$\hat{f}H(\mathbf{x}) = \frac{1}{n} \sum{i=1}^n K_H(\mathbf{x} - \mathbf{x}_i)$$

Full Bandwidth Matrix:

More generally, use a bandwidth matrix $H$ (positive definite $d \times d$ matrix): $$K_H(\mathbf{u}) = |H|^{-1/2} K(H^{-1/2}\mathbf{u})$$

where $K(\cdot)$ is a multivariate kernel (often multivariate Gaussian).

For multivariate Gaussian kernel: $$K_H(\mathbf{u}) = (2\pi)^{-d/2} |H|^{-1/2} \exp\left(-\frac{1}{2}\mathbf{u}^T H^{-1} \mathbf{u}\right)$$

The bandwidth matrix $H$ controls smoothing in all directions and correlations.

The Curse of Dimensionality in KDE:

Recall the optimal MISE convergence rate: $$\text{MISE}^* = O(n^{-4/(d+4)})$$

Dimension	Rate	Samples for 10% MISE
1	n⁻⁰·⁸⁰	~100
2	n⁻⁰·⁶⁷	~1,000
3	n⁻⁰·⁵⁷	~10,000
5	n⁻⁰·⁴⁴	~300,000
10	n⁻⁰·²⁹	~10⁸
20	n⁻⁰·¹⁷	~10¹⁴

KDE in High Dimensions

Bandwidth Matrix Selection:

In $d$ dimensions, the full bandwidth matrix $H$ has $d(d+1)/2$ free parameters. This is usually too many to estimate reliably. Common simplifications:

Scalar bandwidth: $H = h^2 I$ — same bandwidth in all directions. Simplest but doesn't adapt to different scales or correlations.
Diagonal bandwidth: $H = \text{diag}(h_1^2, \ldots, h_d^2)$ — different bandwidths per dimension but no correlation. Often standardize data first.
Full matrix: Allow arbitrary positive definite $H$. Most flexible but requires more data to estimate. Often set $H \propto \widehat{\Sigma}$, the sample covariance matrix.

Scott's Rule for multivariate: $$H = n^{-2/(d+4)} \widehat{\Sigma}$$

Silverman's Rule for multivariate: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \widehat{\Sigma}$$

Both scale the sample covariance by a factor depending on $n$ and $d$.

Computational Considerations

Naive KDE evaluation requires $O(n)$ operations per query point—for $m$ query points, total cost is $O(nm)$. With modern datasets having millions of points, this becomes prohibitive.

Computational Complexity:

For $n$ data points and $m$ evaluation points:

Naive: $O(nm)$ kernel evaluations
With Gaussian kernel: Each evaluation involves exponentiating and summing
For visualization on grid: $m$ can be large (e.g., 1000×1000 = 10⁶)

Fast Algorithms:

Acceleration Techniques for KDE

•Binning Approximations: Bin data first, then apply KDE to bin counts. Reduces effective n to number of bins. Error controlled by bin fineness.
•Fast Fourier Transform (FFT): Exploit convolution theorem. KDE is convolution of kernel with empirical distribution. FFT-based convolution is O((n+m) log(n+m)).
•Tree-based Methods (KD-trees, Ball trees): For bounded-support kernels, only nearby points contribute. Spatial data structures enable O(log n) per query.
•Dual-tree Algorithms: Use trees for both data and query points. Can achieve O(n + m) in favorable cases.
•Low-rank Approximations: Approximate the kernel sum using random projections or Nyström method.
•GPU Acceleration: Parallelize kernel evaluations on graphics hardware. Effective for large-scale visualization.

FFT-Based KDE:

The key insight is that KDE at grid points is a discrete convolution: $$\hat{f}(x_k) = \frac{1}{n} \sum_{i=1}^n K_h(x_k - x_i)$$

If we bin the data onto the same grid:

Create histogram counts $c_j$ on grid ${g_1, \ldots, g_m}$
KDE at grid points is approximately $\hat{f}(g_k) \approx \sum_j c_j K_h(g_k - g_j) / n$
This is a convolution: $\hat{f} \approx (c * K_h) / n$
Apply FFT: $\mathcal{F}(\hat{f}) = \mathcal{F}(c) \cdot \mathcal{F}(K_h) / n$
Inverse FFT to get $\hat{f}$

Total complexity: $O(m \log m)$ for the FFT operations, plus $O(n)$ for binning.

Error Analysis: Binning introduces bias proportional to $(\Delta g)^2$ where $\Delta g$ is grid spacing. With grid spacing much smaller than bandwidth, error is negligible.

kde_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from scipy import stats
 
def kde_naive(data, x_grid, bandwidth=None):
    """
    Naive O(n*m) KDE implementation.
    
    Parameters:
    -----------
    data : array-like
        Sample data points
    x_grid : array-like
        Points at which to evaluate density
    bandwidth : float, optional
        Bandwidth (uses Silverman's rule if None)
    
    Returns:
    --------
    density : array
        Estimated density at each grid point
    """
    data = np.asarray(data)
    x_grid = np.asarray(x_grid)
    n = len(data)
    
    # Silverman's rule of thumb
    if bandwidth is None:
        iqr = np.percentile(data, 75) - np.percentile(data, 25)
        bandwidth = 0.9 * min(np.std(data, ddof=1), iqr/1.34) * n**(-0.2)
    
    # Gaussian kernel
    density = np.zeros(len(x_grid))
    for i, x in enumerate(x_grid):
        # Sum kernel contributions from all data points
        u = (x - data) / bandwidth
        density[i] = np.sum(np.exp(-0.5 * u**2)) / (n * bandwidth * np.sqrt(2*np.pi))
    
    return density
 
def kde_fft(data, n_grid=512, bandwidth=None):
    """
    FFT-accelerated KDE for regular grid evaluation.
    
    Much faster for large n when evaluating on grid.
    """
    data = np.asarray(data)
    n = len(data)
    
    # Silverman's rule
    if bandwidth is None:
        iqr = np.percentile(data, 75) - np.percentile(data, 25)
        bandwidth = 0.9 * min(np.std(data, ddof=1), iqr/1.34) * n**(-0.2)
    
    # Create grid with padding for convolution
    data_min, data_max = data.min(), data.max()
    padding = 3 * bandwidth
    grid_min, grid_max = data_min - padding, data_max + padding
    grid = np.linspace(grid_min, grid_max, n_grid)
    delta = grid[1] - grid[0]
    
    # Bin data onto grid
    counts, _ = np.histogram(data, bins=n_grid, range=(grid_min, grid_max))
    
    # Create kernel array
    kernel_grid = np.arange(-(n_grid//2), n_grid//2 + 1) * delta
    kernel = np.exp(-0.5 * (kernel_grid / bandwidth)**2) / (bandwidth * np.sqrt(2*np.pi))
    kernel = kernel[:n_grid]  # Truncate to match grid size
    
    # FFT convolution
    counts_fft = np.fft.fft(counts)
    kernel_fft = np.fft.fft(kernel)
    density = np.real(np.fft.ifft(counts_fft * kernel_fft)) / n
    
    # Shift to align properly
    density = np.fft.fftshift(density)
    
    return grid, np.maximum(density, 0)  # Ensure non-negative
 
# Compare methods
np.random.seed(42)
data = np.concatenate([np.random.normal(0, 1, 5000), np.random.normal(4, 0.5, 3000)])
 
x_grid = np.linspace(-4, 7, 500)
density_naive = kde_naive(data, x_grid)
grid_fft, density_fft = kde_fft(data, n_grid=500)
 
print(f"Data points: {len(data)}")
print(f"Naive max density: {density_naive.max():.4f}")
print(f"FFT max density: {density_fft.max():.4f}")

Summary and Path Forward

Kernel density estimation represents the gold standard for nonparametric density estimation. We've covered its theoretical foundations and practical aspects in depth:

Key Takeaways

•KDE places smooth bumps at each data point and sums them to create a continuous density estimate, eliminating histogram artifacts.
•The Gaussian kernel is most popular for its mathematical convenience and smoothness; Epanechnikov is theoretically optimal but offers only ~5% efficiency gain.
•Bias ∝ h² (curvature-dependent) and Variance ∝ 1/(nh) — bandwidth controls this tradeoff exactly as theory predicts.
•The optimal bandwidth scales as $h^* \propto n^{-1/5}$, yielding MISE ∝ n^{-4/5} — faster than histograms, slower than parametric.
•Boundary effects cause severe bias near domain edges; solutions include reflection, specialized kernels (Beta, Gamma), or transformation.
•Multivariate KDE suffers from the curse of dimensionality: practical only for $d \leq 5-6$.
•Computational costs are O(nm) naively but can be reduced to O((n+m) log(n+m)) via FFT for grid evaluation.

What's Next:

Page Complete

3 / 5