Density Estimation Fundamentals - Learning Module

Loading content...

0/278

Multivariate Density Estimation

The Challenge of High-Dimensional Density

Real-world data rarely exists in one dimension. Medical records contain dozens of biomarkers. Financial portfolios span hundreds of assets. Customer profiles involve scores of features. Understanding the joint distribution of these variables—not just their marginals—is often the key to insight.

Yet extending density estimation to multiple dimensions reveals one of the most profound challenges in statistics: the curse of dimensionality. This phrase, coined by Richard Bellman in 1961, captures the exponential explosion of difficulty as dimension increases. Volume grows faster than data can fill it. Points that seem close in low dimensions become isolated in high dimensions. Intuitions from 2D and 3D break down completely.

This page confronts multivariate density estimation head-on. We'll develop the mathematical framework for multivariate KDE, understand why the curse emerges, and explore practical strategies for when direct estimation becomes infeasible. By the end, you'll understand both the power and the fundamental limitations of nonparametric density estimation in multiple dimensions.

What You Will Learn

By the end of this page, you will understand the mathematics of multivariate KDE, master bandwidth matrix selection, comprehend the curse of dimensionality at a deep level, and know practical strategies (dimension reduction, sparse methods, parametric models) for high-dimensional density problems.

Multivariate KDE Formulation

Let $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ be i.i.d. observations from unknown density $f(\mathbf{x})$. The multivariate kernel density estimator is:

$$\hat{f}H(\mathbf{x}) = \frac{1}{n|H|^{1/2}} \sum{i=1}^{n} K\left(H^{-1/2}(\mathbf{x} - \mathbf{x}_i)\right)$$

where:

$K: \mathbb{R}^d \to \mathbb{R}$ is a $d$-dimensional kernel function
$H$ is a $d \times d$ positive definite bandwidth matrix
$|H|$ denotes the determinant of $H$
$H^{-1/2}$ is the inverse of the matrix square root of $H$

Properties of Multivariate Kernels:

Non-negativity: $K(\mathbf{u}) \geq 0$
Normalization: $\int_{\mathbb{R}^d} K(\mathbf{u}) , d\mathbf{u} = 1$
Symmetry: $K(\mathbf{u}) = K(-\mathbf{u})$ (spherically or elliptically symmetric)
Finite second moment: $\int \mathbf{u} \mathbf{u}^T K(\mathbf{u}) , d\mathbf{u} < \infty$

Common Multivariate Kernels:

1. Multivariate Gaussian (Standard): $$K(\mathbf{u}) = (2\pi)^{-d/2} \exp\left(-\frac{1}{2} \mathbf{u}^T \mathbf{u}\right)$$

With bandwidth matrix $H$, the scaled kernel becomes: $$K_H(\mathbf{u}) = (2\pi)^{-d/2} |H|^{-1/2} \exp\left(-\frac{1}{2} \mathbf{u}^T H^{-1} \mathbf{u}\right)$$

This is a multivariate Gaussian with covariance matrix $H$.

2. Product Kernel: $$K(\mathbf{u}) = \prod_{j=1}^{d} K_1(u_j)$$

where $K_1$ is a univariate kernel. Common choice: product of univariate Gaussians.

3. Spherically Symmetric Epanechnikov: $$K(\mathbf{u}) = \frac{d+2}{2 V_d} (1 - |\mathbf{u}|^2) \cdot \mathbf{1}(|\mathbf{u}| \leq 1)$$

where $V_d = \pi^{d/2} / \Gamma(d/2 + 1)$ is the volume of the unit $d$-ball.

This is optimal (minimizes AMISE) among spherically symmetric kernels.

Product vs Spherical Kernels

Product kernels treat dimensions independently—the kernel shape is a hypercube aligned with axes. Spherically symmetric kernels have circular level curves in 2D, spherical in 3D. For general bandwidth matrices H, both are 'stretched' into ellipsoids. The choice matters little compared to H selection, but spherically symmetric kernels are slightly more efficient.

The Bandwidth Matrix

In $d$ dimensions, the bandwidth is a $d \times d$ positive definite matrix $H$. This matrix controls:

Scale in each dimension: Different features may have different ranges
Smoothing amount: How much to average nearby points
Correlation structure: How smoothing in one direction relates to another

Bandwidth Matrix Parameterization:

A general $d \times d$ positive definite matrix has $d(d+1)/2$ free parameters. For practical estimation, we usually restrict:

Bandwidth Matrix Parameterizations
Parameterization	Form of H	Free Parameters	When Appropriate
Scalar	H = h²I	1	Data standardized, isotropic smoothing
Diagonal	H = diag(h₁², ..., hd²)	d	Different scales, no rotation
Full	H arbitrary positive definite	d(d+1)/2	Correlated data, tilted densities

Scalar Bandwidth (Isotropic): $$H = h^2 I_d$$

Same smoothing in all directions. Simple but requires data to be standardized first (each dimension scaled to comparable variance).

Diagonal Bandwidth: $$H = \text{diag}(h_1^2, h_2^2, \ldots, h_d^2)$$

Different smoothing amounts per dimension but no rotation. Appropriate when dimensions havevarying scales but are roughly uncorrelated.

Full Bandwidth Matrix: $$H = L L^T \quad \text{(Cholesky decomposition)}$$

Most flexible—can smooth along rotated axes matching data correlations. Required for strongly correlated data where principal directions don't align with coordinate axes.

Practical Choice:

For $d = 2$ or $3$, full $H$ is tractable. For $d > 3$, the $(d(d+1)/2)$ parameters become difficult to estimate. Most practitioners use diagonal $H$ or scale by sample covariance.

Reference Rules for Multivariate Bandwidth:

Scott's Rule: $$H = n^{-2/(d+4)} \hat{\Sigma}$$

where $\hat{\Sigma}$ is the sample covariance matrix.

Silverman's Rule: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \hat{\Sigma}$$

Both rules pre-whiten the data (rotate to principal axes, scale to unit variance) implicitly through the covariance matrix.

Derivation of Multivariate Reference Rules:

For a multivariate Gaussian $N(\boldsymbol{\mu}, \Sigma)$ with $d$ dimensions:

$$R( abla^2 f) = \int | abla^2 f(\mathbf{x})|_F^2 , d\mathbf{x} = \frac{d(d+2)}{4} (4\pi)^{-d/2} |\Sigma|^{-1/2}$$

Substituting into the AMISE-optimal $H$ formula gives the reference rules above.

When Covariance Scaling Fails

Using the sample covariance matrix assumes the data's principal directions are the right axes for smoothing. For multimodal data (mixture of clusters at different orientations), this can be inappropriate. Each mode might need its own bandwidth matrix—leading to adaptive multivariate methods or mixture models.

MISE Analysis in Multiple Dimensions

The bias-variance analysis extends to $d$ dimensions, revealing the devastating impact of the curse of dimensionality.

Asymptotic Mean Integrated Squared Error:

For multivariate KDE with bandwidth matrix $H = h^2 I_d$ (scalar bandwidth):

$$\text{AMISE}(\hat{f}_H) = \frac{h^4}{4} [\mu_2(K)]^2 R( abla^2 f) + \frac{R(K)}{nh^d}$$

where:

$R( abla^2 f) = \int | abla^2 f(\mathbf{x})|_F^2 , d\mathbf{x}$ is the "roughness" of the Laplacian
$\mu_2(K) = \int u_1^2 K(\mathbf{u}) , d\mathbf{u}$ is the second moment in one dimension

Optimal Bandwidth: $$h^* = \left(\frac{d \cdot R(K)}{4 n \mu_2(K)^2 R( abla^2 f)}\right)^{1/(d+4)}$$

Order: $$h^* = O(n^{-1/(d+4)})$$

Optimal AMISE: $$\text{AMISE}^* = O(n^{-4/(d+4)})$$

The Curse in Numbers:

Let's compute the convergence rate for various dimensions:

Dimension d	Exponent 4/(d+4)	Rate
1	4/5 = 0.80	n⁻⁰·⁸⁰
2	4/6 ≈ 0.67	n⁻⁰·⁶⁷
3	4/7 ≈ 0.57	n⁻⁰·⁵⁷
4	4/8 = 0.50	n⁻⁰·⁵⁰
5	4/9 ≈ 0.44	n⁻⁰·⁴⁴
10	4/14 ≈ 0.29	n⁻⁰·²⁹
20	4/24 ≈ 0.17	n⁻⁰·¹⁷
50	4/54 ≈ 0.07	n⁻⁰·⁰⁷

Sample Size Requirements:

For fixed MISE error $\epsilon$, we need $n \propto \epsilon^{-(d+4)/4}$ samples. If we want MISE = 0.01:

Dimension	Samples Needed (Order of Magnitude)
1	~10³
2	~10⁵
3	~10⁶
5	~10⁸
10	~10¹⁴
20	~10²⁴

By dimension 10, we need more samples than atoms in the solar system. This is the curse of dimensionality—not a polynomial slowdown but an exponential explosion.

The Hard Truth

For d > 5-6, nonparametric density estimation is effectively impossible without structural assumptions. This is not a limitation of current algorithms—it's a fundamental mathematical fact. High-dimensional spaces are simply too empty for direct estimation without astronomically large datasets.

Understanding the Curse of Dimensionality

The curse of dimensionality goes beyond sample requirements. Several geometric and probabilistic phenomena underlie it.

Phenomenon 1: Volume Concentration

Consider a $d$-dimensional unit hypercube $[0,1]^d$. What fraction of volume lies within distance $\epsilon$ of the center?

The inner hypercube of radius $r = 0.5 - \epsilon$ has volume $(2r)^d = (1 - 2\epsilon)^d$.

For $\epsilon = 0.05$ (staying 5% away from boundary):

d = 1: $(0.9)^1 = 0.90$ — 90% is interior
d = 10: $(0.9)^{10} \approx 0.35$ — 35% is interior
d = 100: $(0.9)^{100} \approx 0.00003$ — effectively 0%

Implication: In high dimensions, almost all volume is near the boundary. Data points uniformly distributed in a hypercube are mostly on the surface.

Phenomenon 2: Distance Concentration

For uniformly distributed points in a hypercube, the ratio of maximum to minimum pairwise distances converges to 1 as $d \to \infty$: $$\frac{\max_{i,j} |\mathbf{x}_i - \mathbf{x}j|}{\min{i,j} |\mathbf{x}_i - \mathbf{x}_j|} \xrightarrow{d \to \infty} 1$$

All points become approximately equidistant! This makes distance-based methods (k-NN, clusters based on distance) unreliable.

Phenomenon 3: Sparse Sampling

To uniformly cover a $d$-dimensional unit cube with spacing $\epsilon$ in each dimension requires: $$N = (1/\epsilon)^d$$

grid points. For $\epsilon = 0.1$ (10 points per dimension):

d = 2: 100 points
d = 5: 100,000 points
d = 10: 10 billion points
d = 20: 10²⁰ points

Real datasets rarely have this coverage. Most of the space is unexplored.

Phenomenon 4: Gaussian Concentration

For a $d$-dimensional Gaussian $N(\mathbf{0}, I_d)$:

The expected squared distance from the origin is $\mathbb{E}[|\mathbf{X}|^2] = d$
The variance of squared distance is $\text{Var}(|\mathbf{X}|^2) = 2d$

Thus: $$\frac{|\mathbf{X}| - \sqrt{d}}{\sqrt{d}} \xrightarrow{d \to \infty} 0$$

Almost all probability mass concentrates on a thin spherical shell of radius $\sqrt{d}$!

In 1000 dimensions, samples from a Gaussian are essentially all at distance ~31.6 from the origin, in a thin shell. The "center" of the distribution contains almost no probability mass.

Implication for KDE: The kernel placed at a sample point covers a tiny fraction of the space where future samples might appear. To have reasonable coverage, we'd need dense sampling of the shell—requiring exponentially many points.

Why Low-Dimensional Intuitions Fail

In 2D/3D, we can visualize densities, identify clusters, see modes. In 100D, these visual intuitions have no mapping. Points are isolated. Everything is equidistant. The bulk of probability is in weird geometric locations (shells, corners). Nonparametric methods, which rely on local averaging, cannot work without structural assumptions.

Strategies for High-Dimensional Density Estimation

Given the curse of dimensionality, what can be done? The key is imposing structure—reducing the effective dimensionality of the problem.

Strategy 1: Dimensionality Reduction

Project data to lower dimensions, then estimate density there.

Linear Methods:

PCA: Project onto top $k$ principal components
Factor Analysis: Find latent factors explaining variance
Random Projections: Exploit Johnson-Lindenstrauss lemma

Nonlinear Methods:

t-SNE, UMAP: Preserve local structure for visualization
Autoencoders: Learn nonlinear low-dimensional representation

Adjusting for Projection:

If $\mathbf{z} = W^T \mathbf{x}$ where $W$ is $d \times k$ projection, then: $$f_{\mathbf{z}}(\mathbf{z}) = \int f_{\mathbf{x}}(\mathbf{x}) , d\mathbf{x}_{\perp}$$

where the integral is over the orthogonal complement. The projected density $f_{\mathbf{z}}$ can be estimated with KDE, but it's a marginal, not the full joint.

Strategy 2: Structural Assumptions

Independence/Factorization: Assume $f(\mathbf{x}) = \prod_{j=1}^d f_j(x_j)$—full independence. Estimate each marginal separately.

Conditional Independence (Graphical Models): $$f(\mathbf{x}) \propto \prod_{C \in \mathcal{C}} \psi_C(\mathbf{x}_C)$$

The density factors according to a graph structure, with potential functions over cliques.

Copula Models: Separate marginal distributions from dependence structure: $$f(\mathbf{x}) = c(F_1(x_1), \ldots, F_d(x_d)) \cdot \prod_{j=1}^d f_j(x_j)$$

where $c$ is the copula density on $[0,1]^d$. Often parametric families (Gaussian, Clayton, Frank) suffice for $c$.

Strategy 3: Mixture Models

$$f(\mathbf{x}) = \sum_{k=1}^K \pi_k \cdot \phi(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)$$

Gaussian Mixture Models (GMMs) are parametric but flexible. With enough components, they can approximate any density. Covered in detail in Module 3 of this chapter.

Strategy 4: Density Ratio Estimation

Rather than estimating $f(\mathbf{x})$ directly, estimate the ratio: $$r(\mathbf{x}) = \frac{f(\mathbf{x})}{g(\mathbf{x})}$$

where $g$ is a known reference distribution (often Gaussian fitted to data).

Methods like Kernel Mean Matching and Least-Squares Important Fitting estimate $r(\mathbf{x})$ directly without going through density estimation—often easier in high dimensions.

Strategy 5: Neural Density Estimators

Normalizing Flows: $$f(\mathbf{x}) = \pi_0(T^{-1}(\mathbf{x})) \cdot |\det abla T^{-1}(\mathbf{x})|$$

where $\pi_0$ is a simple base distribution and $T$ is a learned invertible neural network.

Variational Autoencoders (VAEs): Approximate the density through a latent variable model with learned encoder and decoder networks.

Autoregressive Models: $$f(\mathbf{x}) = \prod_{j=1}^d f(x_j | x_1, \ldots, x_{j-1})$$

Each conditional is modeled by a neural network. Examples: MADE, PixelCNN, WaveNet.

Choosing a High-Dimensional Strategy

For d = 4-6: Multivariate KDE may still work with large n. Use diagonal or full H with reference rules. For d = 7-20: Consider GMMs, reduced-dimension KDE, or copula models. For d > 20: Neural methods (normalizing flows, VAEs) or strong structural assumptions (factorization, graphical models) are usually necessary.

Multivariate Bandwidth Selection

Selecting the bandwidth matrix $H$ in multiple dimensions is more complex than the univariate case.

For Scalar Bandwidth $H = h^2 I$:

Extend univariate methods:

Reference Rules: Scott's or Silverman's multivariate formula
Cross-Validation: LSCV or LCV generalize directly
Plug-in: Estimate multivariate roughness functional

LSCV for Multivariate:

$$\text{LSCV}(H) = \int \hat{f}H^2(\mathbf{x}) , d\mathbf{x} - \frac{2}{n} \sum{i=1}^n \hat{f}_{H,-i}(\mathbf{x}_i)$$

For Gaussian kernel, the integral has a closed form: $$\int \hat{f}H^2 = \frac{1}{n^2 |2H|^{1/2} (2\pi)^{d/2}} \sum{i=1}^n \sum_{j=1}^n \exp\left(-\frac{1}{4}(\mathbf{x}_i - \mathbf{x}_j)^T H^{-1} (\mathbf{x}_i - \mathbf{x}_j)\right)$$

For Diagonal $H$:

Optimize $d$ bandwidth parameters $(h_1, \ldots, h_d)$. Methods:

Marginal optimization: Select $h_j$ for each dimension independently (ignores correlations)
Joint optimization: Minimize LSCV over all $h_j$ simultaneously
Pre-whitening: Standardize data to have identity covariance, use scalar $h$, then transform back

For Full $H$:

The $(d(d+1)/2)$ parameters are typically too many for reliable optimization with moderate $n$. Practical approach:

Scaled sample covariance: $H = \lambda^2 \hat{\Sigma}$ and optimize only the scalar $\lambda$
Cross-validation: LSCV over a grid of $H$ matrices (expensive)
Plug-in: Multivariate plug-in estimators exist but are complex

Pre-Whitening (Transformation Approach):

Compute sample mean $\bar{\mathbf{x}}$ and covariance $\hat{\Sigma}$
Transform data: $\mathbf{z}_i = \hat{\Sigma}^{-1/2}(\mathbf{x}_i - \bar{\mathbf{x}})$
Apply isotropic KDE (scalar $h$) to $\mathbf{z}_i$
Transform estimate back to original coordinates

This is equivalent to using $H = h^2 \hat{\Sigma}$.

Practical Multivariate Bandwidth Selection

For routine work: pre-whiten data using sample covariance, apply univariate bandwidth selection (Silverman or Sheather-Jones) to the transformed data, and use that scalar h. This combines the adaptability of univariate methods with appropriate scaling for multivariate correlation structure.

Implementation and Visualization

Practical multivariate KDE involves computational and visualization challenges beyond the univariate case.

multivariate_kde.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def multivariate_kde_2d(data, bandwidth='scott', n_grid=100):
    """
    2D Kernel Density Estimation using scipy.
    
    Parameters:
    -----------
    data : array-like, shape (n, 2)
        2D data points
    bandwidth : str or float
        'scott', 'silverman', or numeric scalar factor
    n_grid : int
        Grid resolution for evaluation
    
    Returns:
    --------
    xx, yy : meshgrid
        Grid coordinates
    density : array
        Estimated density on grid
    """
    data = np.asarray(data)
    
    # Create KDE object
    kde = stats.gaussian_kde(data.T, bw_method=bandwidth)
    
    # Create evaluation grid
    x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
    y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, n_grid),
        np.linspace(y_min, y_max, n_grid)
    )
    
    # Evaluate density
    positions = np.vstack([xx.ravel(), yy.ravel()])
    density = kde(positions).reshape(xx.shape)
    
    return xx, yy, density
 
def plot_2d_kde(data, title="2D KDE", bandwidth='scott'):
    """Visualize 2D KDE with contours and scatter."""
    xx, yy, density = multivariate_kde_2d(data, bandwidth)
    
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Filled contours
    cf = ax.contourf(xx, yy, density, levels=20, cmap='viridis', alpha=0.8)
    plt.colorbar(cf, label='Density')
    
    # Contour lines
    ax.contour(xx, yy, density, levels=8, colors='white', linewidths=0.5)
    
    # Data points
    ax.scatter(data[:, 0], data[:, 1], s=5, c='white', alpha=0.5, edgecolors='none')
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title(title)
    
    return fig, ax
 
# Example: Bimodal 2D distribution
np.random.seed(42)
n_samples = 500
 
# Two clusters with different covariances
cluster1 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], n_samples // 2)
cluster2 = np.random.multivariate_normal([4, 3], [[0.5, -0.3], [-0.3, 0.5]], n_samples // 2)
data_2d = np.vstack([cluster1, cluster2])
 
# Compute KDE with different bandwidths
kde = stats.gaussian_kde(data_2d.T)
print(f"Scott bandwidth factor: {kde.factor:.4f}")
print(f"Covariance matrix:
{kde.covariance}")
 
# Different bandwidth effects
for bw in ['scott', 'silverman', 0.1, 0.5]:
    kde_temp = stats.gaussian_kde(data_2d.T, bw_method=bw if isinstance(bw, str) else bw)
    print(f"Bandwidth '{bw}': factor = {kde_temp.factor:.4f}")

Visualization Strategies:

2D Density:

Contour plots (level curves)
Filled contours (heatmaps)
3D surface plots
Scatter with density-based coloring

3D Density:

Isosurfaces (3D analogues of contours)
Slices (2D cuts through 3D density)
Scatter with transparency based on density

Higher Dimensions:

Pair plots (2D marginals and joint densities)
Projected densities (onto 2D planes)
t-SNE/UMAP with density coloring

Computational Efficiency:

For large $n$ and/or high $d$:

FFT-based evaluation on regular grids (2D, 3D)
Tree-based methods (KD-tree, Ball tree) for query acceleration
Dual-tree algorithms for very large datasets
Subsampling for initial exploration
GPU parallelization for massive datasets

Summary and Module Conclusion

We've completed our comprehensive exploration of density estimation fundamentals, from parametric vs nonparametric philosophy to the challenges of high-dimensional estimation. Let's consolidate the key insights from this page and the module as a whole:

Key Takeaways from Multivariate Density

•Multivariate KDE extends the univariate formulation using a bandwidth matrix $H$ that controls smoothing in all directions.
•The bandwidth matrix can be scalar (isotropic), diagonal (different per dimension), or full (with rotations). Choice depends on data structure and sample size.
•MISE scales as $O(n^{-4/(d+4)})$ — convergence deteriorates exponentially with dimension, requiring exponentially more data.
•The curse of dimensionality arises from volume concentration, distance concentration, and sparse sampling in high-dimensional spaces.
•Strategies for high dimensions include dimensionality reduction, structural assumptions (independence, graphical models, copulas), mixture models, and neural density estimators.
•Multivariate bandwidth selection is best approached by pre-whitening data with sample covariance and applying univariate methods.
•For d > 5-6, direct nonparametric estimation is impractical without structural assumptions—this is a mathematical reality, not a software limitation.

Module Summary: Density Estimation Fundamentals

Across this module, we've developed a complete framework for understanding and applying density estimation:

Parametric vs Nonparametric: The fundamental choice between assuming a distributional form vs letting data speak for itself.
Histogram Estimation: The simplest nonparametric estimator, with convergence rate $O(n^{-2/3})$ and limitations in continuity and bin sensitivity.
Kernel Density Estimation: The workhorse nonparametric method, achieving $O(n^{-4/5})$ univariate and $O(n^{-4/(d+4)})$ multivariate rates.
Bandwidth Selection: From reference rules through plug-in methods to cross-validation—the critical practical choice.
Multivariate Extension: Where the curse of dimensionality confronts us with fundamental limits.

This foundation prepares you for the subsequent modules on Kernel Density Estimation theory, Gaussian Mixture Models, the EM algorithm, and density-based clustering methods.

Module Complete

You have now mastered the fundamentals of density estimation—from the philosophical distinction between parametric and nonparametric approaches through the mathematical theory of histograms and KDE, to the practical challenges of bandwidth selection and multivariate extension. This foundation underpins all the probabilistic machine learning methods we'll explore in subsequent modules.

Multivariate Density Estimation

The Challenge of High-Dimensional Density

What You Will Learn

Multivariate KDE Formulation

Let $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ be i.i.d. observations from unknown density $f(\mathbf{x})$. The multivariate kernel density estimator is:

$$\hat{f}H(\mathbf{x}) = \frac{1}{n|H|^{1/2}} \sum{i=1}^{n} K\left(H^{-1/2}(\mathbf{x} - \mathbf{x}_i)\right)$$

where:

$K: \mathbb{R}^d \to \mathbb{R}$ is a $d$-dimensional kernel function
$H$ is a $d \times d$ positive definite bandwidth matrix
$|H|$ denotes the determinant of $H$
$H^{-1/2}$ is the inverse of the matrix square root of $H$

Properties of Multivariate Kernels:

Non-negativity: $K(\mathbf{u}) \geq 0$
Normalization: $\int_{\mathbb{R}^d} K(\mathbf{u}) , d\mathbf{u} = 1$
Symmetry: $K(\mathbf{u}) = K(-\mathbf{u})$ (spherically or elliptically symmetric)
Finite second moment: $\int \mathbf{u} \mathbf{u}^T K(\mathbf{u}) , d\mathbf{u} < \infty$

Common Multivariate Kernels:

1. Multivariate Gaussian (Standard): $$K(\mathbf{u}) = (2\pi)^{-d/2} \exp\left(-\frac{1}{2} \mathbf{u}^T \mathbf{u}\right)$$

With bandwidth matrix $H$, the scaled kernel becomes: $$K_H(\mathbf{u}) = (2\pi)^{-d/2} |H|^{-1/2} \exp\left(-\frac{1}{2} \mathbf{u}^T H^{-1} \mathbf{u}\right)$$

This is a multivariate Gaussian with covariance matrix $H$.

2. Product Kernel: $$K(\mathbf{u}) = \prod_{j=1}^{d} K_1(u_j)$$

where $K_1$ is a univariate kernel. Common choice: product of univariate Gaussians.

3. Spherically Symmetric Epanechnikov: $$K(\mathbf{u}) = \frac{d+2}{2 V_d} (1 - |\mathbf{u}|^2) \cdot \mathbf{1}(|\mathbf{u}| \leq 1)$$

where $V_d = \pi^{d/2} / \Gamma(d/2 + 1)$ is the volume of the unit $d$-ball.

This is optimal (minimizes AMISE) among spherically symmetric kernels.

Product vs Spherical Kernels

The Bandwidth Matrix

In $d$ dimensions, the bandwidth is a $d \times d$ positive definite matrix $H$. This matrix controls:

Scale in each dimension: Different features may have different ranges
Smoothing amount: How much to average nearby points
Correlation structure: How smoothing in one direction relates to another

Bandwidth Matrix Parameterization:

A general $d \times d$ positive definite matrix has $d(d+1)/2$ free parameters. For practical estimation, we usually restrict:

Bandwidth Matrix Parameterizations
Parameterization	Form of H	Free Parameters	When Appropriate
Scalar	H = h²I	1	Data standardized, isotropic smoothing
Diagonal	H = diag(h₁², ..., hd²)	d	Different scales, no rotation
Full	H arbitrary positive definite	d(d+1)/2	Correlated data, tilted densities

Scalar Bandwidth (Isotropic): $$H = h^2 I_d$$

Same smoothing in all directions. Simple but requires data to be standardized first (each dimension scaled to comparable variance).

Diagonal Bandwidth: $$H = \text{diag}(h_1^2, h_2^2, \ldots, h_d^2)$$

Different smoothing amounts per dimension but no rotation. Appropriate when dimensions havevarying scales but are roughly uncorrelated.

Full Bandwidth Matrix: $$H = L L^T \quad \text{(Cholesky decomposition)}$$

Most flexible—can smooth along rotated axes matching data correlations. Required for strongly correlated data where principal directions don't align with coordinate axes.

Practical Choice:

For $d = 2$ or $3$, full $H$ is tractable. For $d > 3$, the $(d(d+1)/2)$ parameters become difficult to estimate. Most practitioners use diagonal $H$ or scale by sample covariance.

Reference Rules for Multivariate Bandwidth:

Scott's Rule: $$H = n^{-2/(d+4)} \hat{\Sigma}$$

where $\hat{\Sigma}$ is the sample covariance matrix.

Silverman's Rule: $$H = \left(\frac{4}{(d+2)n}\right)^{2/(d+4)} \hat{\Sigma}$$

Both rules pre-whiten the data (rotate to principal axes, scale to unit variance) implicitly through the covariance matrix.

Derivation of Multivariate Reference Rules:

For a multivariate Gaussian $N(\boldsymbol{\mu}, \Sigma)$ with $d$ dimensions:

$$R( abla^2 f) = \int | abla^2 f(\mathbf{x})|_F^2 , d\mathbf{x} = \frac{d(d+2)}{4} (4\pi)^{-d/2} |\Sigma|^{-1/2}$$

Substituting into the AMISE-optimal $H$ formula gives the reference rules above.

When Covariance Scaling Fails

MISE Analysis in Multiple Dimensions

The bias-variance analysis extends to $d$ dimensions, revealing the devastating impact of the curse of dimensionality.

Asymptotic Mean Integrated Squared Error:

For multivariate KDE with bandwidth matrix $H = h^2 I_d$ (scalar bandwidth):

$$\text{AMISE}(\hat{f}_H) = \frac{h^4}{4} [\mu_2(K)]^2 R( abla^2 f) + \frac{R(K)}{nh^d}$$

where:

$R( abla^2 f) = \int | abla^2 f(\mathbf{x})|_F^2 , d\mathbf{x}$ is the "roughness" of the Laplacian
$\mu_2(K) = \int u_1^2 K(\mathbf{u}) , d\mathbf{u}$ is the second moment in one dimension

Optimal Bandwidth: $$h^* = \left(\frac{d \cdot R(K)}{4 n \mu_2(K)^2 R( abla^2 f)}\right)^{1/(d+4)}$$

Order: $$h^* = O(n^{-1/(d+4)})$$

Optimal AMISE: $$\text{AMISE}^* = O(n^{-4/(d+4)})$$

The Curse in Numbers:

Let's compute the convergence rate for various dimensions:

Dimension d	Exponent 4/(d+4)	Rate
1	4/5 = 0.80	n⁻⁰·⁸⁰
2	4/6 ≈ 0.67	n⁻⁰·⁶⁷
3	4/7 ≈ 0.57	n⁻⁰·⁵⁷
4	4/8 = 0.50	n⁻⁰·⁵⁰
5	4/9 ≈ 0.44	n⁻⁰·⁴⁴
10	4/14 ≈ 0.29	n⁻⁰·²⁹
20	4/24 ≈ 0.17	n⁻⁰·¹⁷
50	4/54 ≈ 0.07	n⁻⁰·⁰⁷

Sample Size Requirements:

For fixed MISE error $\epsilon$, we need $n \propto \epsilon^{-(d+4)/4}$ samples. If we want MISE = 0.01:

Dimension	Samples Needed (Order of Magnitude)
1	~10³
2	~10⁵
3	~10⁶
5	~10⁸
10	~10¹⁴
20	~10²⁴

By dimension 10, we need more samples than atoms in the solar system. This is the curse of dimensionality—not a polynomial slowdown but an exponential explosion.

The Hard Truth

Understanding the Curse of Dimensionality

The curse of dimensionality goes beyond sample requirements. Several geometric and probabilistic phenomena underlie it.

Phenomenon 1: Volume Concentration

Consider a $d$-dimensional unit hypercube $[0,1]^d$. What fraction of volume lies within distance $\epsilon$ of the center?

The inner hypercube of radius $r = 0.5 - \epsilon$ has volume $(2r)^d = (1 - 2\epsilon)^d$.

For $\epsilon = 0.05$ (staying 5% away from boundary):

d = 1: $(0.9)^1 = 0.90$ — 90% is interior
d = 10: $(0.9)^{10} \approx 0.35$ — 35% is interior
d = 100: $(0.9)^{100} \approx 0.00003$ — effectively 0%

Implication: In high dimensions, almost all volume is near the boundary. Data points uniformly distributed in a hypercube are mostly on the surface.

Phenomenon 2: Distance Concentration

All points become approximately equidistant! This makes distance-based methods (k-NN, clusters based on distance) unreliable.

Phenomenon 3: Sparse Sampling

To uniformly cover a $d$-dimensional unit cube with spacing $\epsilon$ in each dimension requires: $$N = (1/\epsilon)^d$$

grid points. For $\epsilon = 0.1$ (10 points per dimension):

d = 2: 100 points
d = 5: 100,000 points
d = 10: 10 billion points
d = 20: 10²⁰ points

Real datasets rarely have this coverage. Most of the space is unexplored.

Phenomenon 4: Gaussian Concentration

For a $d$-dimensional Gaussian $N(\mathbf{0}, I_d)$:

The expected squared distance from the origin is $\mathbb{E}[|\mathbf{X}|^2] = d$
The variance of squared distance is $\text{Var}(|\mathbf{X}|^2) = 2d$

Thus: $$\frac{|\mathbf{X}| - \sqrt{d}}{\sqrt{d}} \xrightarrow{d \to \infty} 0$$

Almost all probability mass concentrates on a thin spherical shell of radius $\sqrt{d}$!

In 1000 dimensions, samples from a Gaussian are essentially all at distance ~31.6 from the origin, in a thin shell. The "center" of the distribution contains almost no probability mass.

Why Low-Dimensional Intuitions Fail

Strategies for High-Dimensional Density Estimation

Given the curse of dimensionality, what can be done? The key is imposing structure—reducing the effective dimensionality of the problem.

Strategy 1: Dimensionality Reduction

Project data to lower dimensions, then estimate density there.

Linear Methods:

PCA: Project onto top $k$ principal components
Factor Analysis: Find latent factors explaining variance
Random Projections: Exploit Johnson-Lindenstrauss lemma

Nonlinear Methods:

t-SNE, UMAP: Preserve local structure for visualization
Autoencoders: Learn nonlinear low-dimensional representation

Adjusting for Projection:

If $\mathbf{z} = W^T \mathbf{x}$ where $W$ is $d \times k$ projection, then: $$f_{\mathbf{z}}(\mathbf{z}) = \int f_{\mathbf{x}}(\mathbf{x}) , d\mathbf{x}_{\perp}$$

where the integral is over the orthogonal complement. The projected density $f_{\mathbf{z}}$ can be estimated with KDE, but it's a marginal, not the full joint.

Strategy 2: Structural Assumptions

Independence/Factorization: Assume $f(\mathbf{x}) = \prod_{j=1}^d f_j(x_j)$—full independence. Estimate each marginal separately.

Conditional Independence (Graphical Models): $$f(\mathbf{x}) \propto \prod_{C \in \mathcal{C}} \psi_C(\mathbf{x}_C)$$

The density factors according to a graph structure, with potential functions over cliques.

Copula Models: Separate marginal distributions from dependence structure: $$f(\mathbf{x}) = c(F_1(x_1), \ldots, F_d(x_d)) \cdot \prod_{j=1}^d f_j(x_j)$$

where $c$ is the copula density on $[0,1]^d$. Often parametric families (Gaussian, Clayton, Frank) suffice for $c$.

Strategy 3: Mixture Models

$$f(\mathbf{x}) = \sum_{k=1}^K \pi_k \cdot \phi(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)$$

Gaussian Mixture Models (GMMs) are parametric but flexible. With enough components, they can approximate any density. Covered in detail in Module 3 of this chapter.

Strategy 4: Density Ratio Estimation

Rather than estimating $f(\mathbf{x})$ directly, estimate the ratio: $$r(\mathbf{x}) = \frac{f(\mathbf{x})}{g(\mathbf{x})}$$

where $g$ is a known reference distribution (often Gaussian fitted to data).

Methods like Kernel Mean Matching and Least-Squares Important Fitting estimate $r(\mathbf{x})$ directly without going through density estimation—often easier in high dimensions.

Strategy 5: Neural Density Estimators

Normalizing Flows: $$f(\mathbf{x}) = \pi_0(T^{-1}(\mathbf{x})) \cdot |\det abla T^{-1}(\mathbf{x})|$$

where $\pi_0$ is a simple base distribution and $T$ is a learned invertible neural network.

Variational Autoencoders (VAEs): Approximate the density through a latent variable model with learned encoder and decoder networks.

Autoregressive Models: $$f(\mathbf{x}) = \prod_{j=1}^d f(x_j | x_1, \ldots, x_{j-1})$$

Each conditional is modeled by a neural network. Examples: MADE, PixelCNN, WaveNet.

Choosing a High-Dimensional Strategy

Multivariate Bandwidth Selection

Selecting the bandwidth matrix $H$ in multiple dimensions is more complex than the univariate case.

For Scalar Bandwidth $H = h^2 I$:

Extend univariate methods:

Reference Rules: Scott's or Silverman's multivariate formula
Cross-Validation: LSCV or LCV generalize directly
Plug-in: Estimate multivariate roughness functional

LSCV for Multivariate:

$$\text{LSCV}(H) = \int \hat{f}H^2(\mathbf{x}) , d\mathbf{x} - \frac{2}{n} \sum{i=1}^n \hat{f}_{H,-i}(\mathbf{x}_i)$$

For Diagonal $H$:

Optimize $d$ bandwidth parameters $(h_1, \ldots, h_d)$. Methods:

Marginal optimization: Select $h_j$ for each dimension independently (ignores correlations)
Joint optimization: Minimize LSCV over all $h_j$ simultaneously
Pre-whitening: Standardize data to have identity covariance, use scalar $h$, then transform back

For Full $H$:

The $(d(d+1)/2)$ parameters are typically too many for reliable optimization with moderate $n$. Practical approach:

Scaled sample covariance: $H = \lambda^2 \hat{\Sigma}$ and optimize only the scalar $\lambda$
Cross-validation: LSCV over a grid of $H$ matrices (expensive)
Plug-in: Multivariate plug-in estimators exist but are complex

Pre-Whitening (Transformation Approach):

Compute sample mean $\bar{\mathbf{x}}$ and covariance $\hat{\Sigma}$
Transform data: $\mathbf{z}_i = \hat{\Sigma}^{-1/2}(\mathbf{x}_i - \bar{\mathbf{x}})$
Apply isotropic KDE (scalar $h$) to $\mathbf{z}_i$
Transform estimate back to original coordinates

This is equivalent to using $H = h^2 \hat{\Sigma}$.

Practical Multivariate Bandwidth Selection

Implementation and Visualization

Practical multivariate KDE involves computational and visualization challenges beyond the univariate case.

multivariate_kde.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def multivariate_kde_2d(data, bandwidth='scott', n_grid=100):
    """
    2D Kernel Density Estimation using scipy.
    
    Parameters:
    -----------
    data : array-like, shape (n, 2)
        2D data points
    bandwidth : str or float
        'scott', 'silverman', or numeric scalar factor
    n_grid : int
        Grid resolution for evaluation
    
    Returns:
    --------
    xx, yy : meshgrid
        Grid coordinates
    density : array
        Estimated density on grid
    """
    data = np.asarray(data)
    
    # Create KDE object
    kde = stats.gaussian_kde(data.T, bw_method=bandwidth)
    
    # Create evaluation grid
    x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
    y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, n_grid),
        np.linspace(y_min, y_max, n_grid)
    )
    
    # Evaluate density
    positions = np.vstack([xx.ravel(), yy.ravel()])
    density = kde(positions).reshape(xx.shape)
    
    return xx, yy, density
 
def plot_2d_kde(data, title="2D KDE", bandwidth='scott'):
    """Visualize 2D KDE with contours and scatter."""
    xx, yy, density = multivariate_kde_2d(data, bandwidth)
    
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Filled contours
    cf = ax.contourf(xx, yy, density, levels=20, cmap='viridis', alpha=0.8)
    plt.colorbar(cf, label='Density')
    
    # Contour lines
    ax.contour(xx, yy, density, levels=8, colors='white', linewidths=0.5)
    
    # Data points
    ax.scatter(data[:, 0], data[:, 1], s=5, c='white', alpha=0.5, edgecolors='none')
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title(title)
    
    return fig, ax
 
# Example: Bimodal 2D distribution
np.random.seed(42)
n_samples = 500
 
# Two clusters with different covariances
cluster1 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], n_samples // 2)
cluster2 = np.random.multivariate_normal([4, 3], [[0.5, -0.3], [-0.3, 0.5]], n_samples // 2)
data_2d = np.vstack([cluster1, cluster2])
 
# Compute KDE with different bandwidths
kde = stats.gaussian_kde(data_2d.T)
print(f"Scott bandwidth factor: {kde.factor:.4f}")
print(f"Covariance matrix:
{kde.covariance}")
 
# Different bandwidth effects
for bw in ['scott', 'silverman', 0.1, 0.5]:
    kde_temp = stats.gaussian_kde(data_2d.T, bw_method=bw if isinstance(bw, str) else bw)
    print(f"Bandwidth '{bw}': factor = {kde_temp.factor:.4f}")

Visualization Strategies:

2D Density:

Contour plots (level curves)
Filled contours (heatmaps)
3D surface plots
Scatter with density-based coloring

3D Density:

Isosurfaces (3D analogues of contours)
Slices (2D cuts through 3D density)
Scatter with transparency based on density

Higher Dimensions:

Pair plots (2D marginals and joint densities)
Projected densities (onto 2D planes)
t-SNE/UMAP with density coloring

Computational Efficiency:

For large $n$ and/or high $d$:

FFT-based evaluation on regular grids (2D, 3D)
Tree-based methods (KD-tree, Ball tree) for query acceleration
Dual-tree algorithms for very large datasets
Subsampling for initial exploration
GPU parallelization for massive datasets

Summary and Module Conclusion

Key Takeaways from Multivariate Density

•Multivariate KDE extends the univariate formulation using a bandwidth matrix $H$ that controls smoothing in all directions.
•The bandwidth matrix can be scalar (isotropic), diagonal (different per dimension), or full (with rotations). Choice depends on data structure and sample size.
•MISE scales as $O(n^{-4/(d+4)})$ — convergence deteriorates exponentially with dimension, requiring exponentially more data.
•The curse of dimensionality arises from volume concentration, distance concentration, and sparse sampling in high-dimensional spaces.
•Strategies for high dimensions include dimensionality reduction, structural assumptions (independence, graphical models, copulas), mixture models, and neural density estimators.
•Multivariate bandwidth selection is best approached by pre-whitening data with sample covariance and applying univariate methods.
•For d > 5-6, direct nonparametric estimation is impractical without structural assumptions—this is a mathematical reality, not a software limitation.

Module Summary: Density Estimation Fundamentals

Across this module, we've developed a complete framework for understanding and applying density estimation:

Parametric vs Nonparametric: The fundamental choice between assuming a distributional form vs letting data speak for itself.
Histogram Estimation: The simplest nonparametric estimator, with convergence rate $O(n^{-2/3})$ and limitations in continuity and bin sensitivity.
Kernel Density Estimation: The workhorse nonparametric method, achieving $O(n^{-4/5})$ univariate and $O(n^{-4/(d+4)})$ multivariate rates.
Bandwidth Selection: From reference rules through plug-in methods to cross-validation—the critical practical choice.
Multivariate Extension: Where the curse of dimensionality confronts us with fundamental limits.

This foundation prepares you for the subsequent modules on Kernel Density Estimation theory, Gaussian Mixture Models, the EM algorithm, and density-based clustering methods.

Module Complete