Density Estimation Fundamentals - Learning Module

Loading content...

0/278

Histogram Estimation

The Humble Histogram: Foundation of Density Estimation

The histogram is perhaps the most recognizable statistical graphic in existence. From introductory statistics courses to high-dimensional data exploration, histograms serve as our primary window into the distributional properties of data. Yet beneath this familiar visualization lies a rigorous framework for nonparametric density estimation.

Introduced by Karl Pearson in 1895 (the term "histogram" itself was coined in a 1891 lecture), the histogram remains ubiquitous for good reason: it is simple to compute, easy to interpret, and often sufficient for exploratory analysis. But histograms are more than just bar charts—they are consistent density estimators with well-understood mathematical properties.

In this page, we'll move far beyond the superficial understanding of histograms. We'll explore their mathematical foundations, analyze their statistical properties, understand their limitations, and learn principled approaches to one of the most vexing questions in practical statistics: How do we choose the bin width?

What You Will Learn

By the end of this page, you will understand histograms as formal density estimators, derive their bias and variance properties, learn multiple bin width selection rules (including Sturges, Scott, and Freedman-Diaconis), and recognize histograms as a stepping stone to more sophisticated methods like kernel density estimation.

Histogram Construction

Let's begin with a precise definition. Given data $\mathcal{D} = {x_1, x_2, \ldots, x_n}$, a histogram partitions the data range into bins and estimates density based on the proportion of points falling into each bin.

Formal Definition:

Choose bin boundaries $b_0 < b_1 < \cdots < b_m$ such that $b_0 \leq \min_i x_i$ and $b_m \geq \max_i x_i$. Define bins $B_j = [b_{j-1}, b_j)$ for $j = 1, \ldots, m$ (with the last bin typically closed on both ends).

Let $n_j$ denote the count of observations in bin $B_j$: $$n_j = \sum_{i=1}^{n} \mathbf{1}(x_i \in B_j)$$

where $\mathbf{1}(\cdot)$ is the indicator function.

The histogram density estimator at point $x$ is: $$\hat{f}_h(x) = \frac{n_j}{n \cdot h_j} \quad \text{for } x \in B_j$$

where $h_j = b_j - b_{j-1}$ is the width of bin $B_j$.

For equal-width bins with common width $h = h_j$ for all $j$: $$\hat{f}_h(x) = \frac{n_j}{nh} \quad \text{for } x \in B_j$$

Why Divide by Bin Width?

A critical but often overlooked point: histograms for density estimation must divide by bin width (not just by n). This ensures the histogram integrates to 1 (a valid probability density). The frequency histogram (just n_j/n) gives probability mass per bin, not density. For density, we need mass per unit width.

Verification That Histogram Is a Valid Density:

$$\int_{-\infty}^{\infty} \hat{f}h(x) , dx = \sum{j=1}^{m} \frac{n_j}{nh_j} \cdot h_j = \sum_{j=1}^{m} \frac{n_j}{n} = \frac{\sum_{j=1}^{m} n_j}{n} = \frac{n}{n} = 1$$

Also, $\hat{f}_h(x) \geq 0$ everywhere, so the histogram is indeed a valid probability density function.

Two Critical Choices:

Bin width (h): Controls the resolution/smoothness tradeoff
Bin origin ($b_0$): Determines where bins start

The bin width is by far the more important choice—it determines the fundamental properties of the estimate. The bin origin has minor effects but can shift features slightly. We'll focus primarily on bin width selection.

Components of a Histogram Density Estimator
Component	Notation	Role	Key Consideration
Bin boundaries	b₀, b₁, ..., bₘ	Partition the data range	Typically evenly spaced
Bin width	h = bⱼ - bⱼ₋₁	Width of each interval	Main smoothing parameter
Bin count	nⱼ	Points in bin j	Random variable
Number of bins	m	Resolution of partition	m ≈ range/h
Density estimate	f̂(x) = nⱼ/(nh)	Estimated density at x	Piecewise constant

Statistical Properties

Understanding the statistical properties of the histogram estimator requires analyzing its bias, variance, and mean squared error. These calculations reveal how bin width controls the fundamental tradeoff between smoothness and fidelity to data.

Expected Value of Bin Count:

For a point $x$ in bin $B_j$, the bin count $n_j$ follows a binomial distribution: $$n_j \sim \text{Binomial}(n, p_j)$$

where $p_j = \int_{B_j} f(x) , dx = \mathbb{P}(X \in B_j)$ is the probability of landing in bin $B_j$.

Thus: $$\mathbb{E}[n_j] = n \cdot p_j \approx n \cdot h \cdot f(x)$$

where the approximation uses $p_j \approx h \cdot f(x)$ for small bin width.

Bias Analysis:

The expected value of the histogram estimator at a point $x$ in bin $B_j$ is: $$\mathbb{E}[\hat{f}_h(x)] = \mathbb{E}\left[\frac{n_j}{nh}\right] = \frac{p_j}{h}$$

Using Taylor expansion, if $f$ is twice differentiable, the probability of falling in bin $B_j = [b_{j-1}, b_j)$ is: $$p_j = \int_{b_{j-1}}^{b_j} f(t) , dt = h \cdot f(c_j) + \frac{h^3}{24} f''(c_j) + O(h^5)$$

where $c_j = (b_{j-1} + b_j)/2$ is the bin center.

Therefore: $$\mathbb{E}[\hat{f}_h(x)] = f(c_j) + \frac{h^2}{24} f''(c_j) + O(h^4)$$

The bias at point $x$ (which may not equal the bin center) involves two components:

Discretization bias: $f(c_j) - f(x)$ — we estimate density at the bin center, not at $x$
Smoothing bias: $\frac{h^2}{24} f''(c_j)$ — averaging within bins smooths the density

The integrated squared bias is: $$\int \text{Bias}^2[\hat{f}_h(x)] , dx = O(h^2)$$

Bias decreases as $h \to 0$ (smaller bins, less averaging).

Variance Analysis:

Since $n_j \sim \text{Binomial}(n, p_j)$, we have $\text{Var}(n_j) = np_j(1-p_j)$.

For small bin width, $p_j \approx hf(x) \ll 1$, so: $$\text{Var}(n_j) \approx np_j \approx nhf(x)$$

The variance of the histogram estimator is: $$\text{Var}[\hat{f}_h(x)] = \text{Var}\left[\frac{n_j}{nh}\right] = \frac{\text{Var}(n_j)}{n^2h^2} \approx \frac{nhf(x)}{n^2h^2} = \frac{f(x)}{nh}$$

The integrated variance is: $$\int \text{Var}[\hat{f}_h(x)] , dx \approx \frac{1}{nh}$$

Variance decreases as $h$ increases (larger bins, more points per bin) or as $n$ increases.

The Fundamental Tension

Note the conflicting behavior: Bias decreases with smaller h (wanting h → 0), but Variance decreases with larger h (wanting h → ∞). This is the bias-variance tradeoff in action. The optimal h must balance these opposing forces.

Mean Integrated Squared Error (MISE):

Combining bias and variance: $$\text{MISE}(\hat{f}_h) = \int \mathbb{E}[(\hat{f}_h(x) - f(x))^2] , dx = \int \text{Bias}^2 , dx + \int \text{Var} , dx$$

For the histogram: $$\text{MISE}(\hat{f}_h) \approx \frac{h^2}{12} \int [f'(x)]^2 , dx + \frac{1}{nh}$$

(The bias term comes from discretization; the exact coefficient depends on assumptions.)

Optimal Bin Width:

Minimizing MISE with respect to $h$: $$\frac{d}{dh} \text{MISE} = \frac{2h}{12} R(f') - \frac{1}{nh^2} = 0$$

where $R(f') = \int [f'(x)]^2 , dx$ is the roughness of $f'$.

Solving: $$h^* = \left(\frac{6}{n \cdot R(f')}\right)^{1/3}$$

The optimal MISE scales as: $$\text{MISE}^* = O(n^{-2/3})$$

Compare this to KDE's $O(n^{-4/5})$ rate—histograms converge slower due to their piecewise-constant nature.

Histogram Statistical Properties Summary
Property	Formula	Behavior	Implication
Integrated Bias²	O(h²)	Decreases with smaller h	Fine bins reduce approximation error
Integrated Variance	O(1/nh)	Decreases with larger h	Coarse bins reduce sampling error
Optimal h	O(n⁻¹/³)	Shrinks with more data	More data → finer resolution
Optimal MISE	O(n⁻²/³)	Slower than KDE (n⁻⁴/⁵)	Histograms less efficient
Convergence rate	n⁻²/³	Sub-optimal for smooth densities	Piecewise constant limits accuracy

Bin Width Selection Rules

The theoretical optimal bin width depends on $R(f')$, which involves the unknown density $f$—a classic chicken-and-egg problem. Practical bin width selection rules make assumptions or use data-driven approaches to estimate reasonable values.

1. Sturges' Rule (1926):

One of the oldest rules, based on the assumption that data follow a normal distribution: $$m = 1 + \log_2(n) = 1 + 3.322 \cdot \log_{10}(n)$$ $$h = \frac{\text{range}}{m} = \frac{\max(x) - \min(x)}{1 + \log_2(n)}$$

Derivation: Sturges assumed the data come from a normal distribution and that the bin counts should follow a binomial distribution approximating the normal. Under this idealized scenario, $1 + \log_2(n)$ bins suffice.

Limitations:

Severely underestimates bins for large $n$ (10,000 points → only 14 bins)
Assumes normality—fails for skewed or multimodal data
Ignores data variability

2. Scott's Rule (1979):

Derived by minimizing MISE assuming a normal reference distribution: $$h = 3.49 \cdot \hat{\sigma} \cdot n^{-1/3}$$

where $\hat{\sigma}$ is the sample standard deviation.

Derivation: For a normal density $N(\mu, \sigma^2)$, we have $R(f') = \frac{1}{2\sqrt{\pi}\sigma^3}$. Substituting into the optimal bin width formula gives: $$h^* = \left(\frac{24\sqrt{\pi}}{n}\right)^{1/3} \sigma \approx 3.49 \cdot \sigma \cdot n^{-1/3}$$

Advantages over Sturges:

Accounts for data variability through $\hat{\sigma}$
Correct $n^{-1/3}$ scaling
Works reasonably for unimodal, symmetric distributions

Limitations:

Normal reference can oversmooth for multimodal data
Standard deviation is sensitive to outliers

3. Freedman-Diaconis Rule (1981):

A robust alternative using the interquartile range (IQR): $$h = 2 \cdot \text{IQR} \cdot n^{-1/3}$$

where $\text{IQR} = Q_3 - Q_1$ (75th percentile minus 25th percentile).

Rationale: For a normal distribution, $\text{IQR} \approx 1.35\sigma$, so this gives approximately the same result as Scott's rule. However, IQR is more robust to outliers—even 25% of the data being outliers won't affect it.

Advantages:

Robust to outliers and heavy tails
Less sensitive to non-normality than Scott's rule
Often the default in statistical software

Limitations:

Can undersmooth for heavy-tailed distributions
Still assumes roughly unimodal symmetric shape

Comparison of Bin Width Selection Rules
Rule	Formula for h	Assumptions	Best For	Weaknesses
Sturges (1926)	range / (1 + log₂n)	Normal data	Small, normal samples	Very few bins for large n
Scott (1979)	3.49 σ̂ n⁻¹/³	Normal reference	Unimodal symmetric	Sensitive to outliers
Freedman-Diaconis	2 · IQR · n⁻¹/³	Robustness needed	Data with outliers	Can undersmooth heavy tails
Doane (1976)	Sturges + skewness correction	Skewed normal	Moderately skewed	Limited for multimodality
sqrt(n)	range / √n	None (rule of thumb)	Quick exploration	No theoretical basis

4. Cross-Validation:

A data-driven approach that doesn't assume a reference distribution:

Leave-One-Out Cross-Validation: $$\text{CV}(h) = \frac{2}{(n-1)h} - \frac{n+1}{(n-1)n^2h} \sum_{j=1}^{m} n_j^2$$

Minimize CV$(h)$ over a grid of $h$ values. This optimizes an unbiased estimate of ISE without knowing $f$.

Advantages:

Fully data-driven
Adapts to any distributional shape
Theoretically optimal under mild conditions

Disadvantages:

Computationally intensive
Solution may not be unique
High variability in selected $h$ for small samples

Practical Recommendation

For exploratory analysis, start with Freedman-Diaconis (robust default), then try multiple bin widths to see how features change. Sturges is only appropriate for small samples from nearly-normal distributions. For publication-quality figures, consider cross-validation or simply choose a width that best reveals the structure you want to communicate.

Histogram Limitations

Despite their utility, histograms have fundamental limitations that motivated the development of more sophisticated density estimators. Understanding these limitations is essential for knowing when to use alternative methods.

Fundamental Limitations of Histograms

•Discontinuity: The histogram is a step function—discontinuous at bin boundaries. Real densities are typically continuous, so this introduces artifacts.
•Bin Origin Dependence: Shifting the bin origin can change the appearance of the histogram significantly, even with the same bin width. Features can appear or disappear based on arbitrary placement.
•Information Loss: All points within a bin contribute equally regardless of their exact positions. Two datasets with vastly different within-bin distributions can produce identical histograms.
•Suboptimal Convergence Rate: The $O(n^{-2/3})$ rate is slower than the $O(n^{-4/5})$ achievable by KDE for smooth densities. Histograms waste data.
•Fixed Resolution: A single bin width applies everywhere. But some regions may need finer resolution (near modes) while others need coarser (in the tails).
•Multivariate Curse: In $d$ dimensions, estimating the histogram requires $O(m^d)$ bins. For $d = 10$ with 10 bins per dimension, that's 10 billion bins—most empty.

Demonstration: Bin Origin Sensitivity

Consider data ${0.9, 1.1, 2.9, 3.1}$ representing two clusters near 1 and 3.

Origin at 0, width 1: Bins [0,1), [1,2), [2,3), [3,4) give counts {1, 1, 1, 1}—appears uniform!
Origin at 0.5, width 1: Bins [0.5,1.5), [1.5,2.5), [2.5,3.5) give counts {2, 0, 2}—correctly shows bimodality.

Same data, same bin width, completely different visual impression. The "true" structure (two clusters) is revealed or hidden based on an arbitrary choice.

Addressing Bin Origin Sensitivity

One mitigation is the Averaged Shifted Histogram (ASH), which averages multiple histograms with different origins. This produces a smoother estimate but adds complexity. Kernel density estimation solves these issues more elegantly by using points themselves as centers.

Discontinuity and Derivative Failure:

For many applications, we need not just the density but also its derivatives. For example:

Mode finding: Requires $f'(x) = 0$
Curvature analysis: Requires $f''(x)$
Score matching: Uses $ abla \log f(x)$

Histograms fail here entirely—their derivative is zero almost everywhere (within bins) and undefined at bin boundaries. This makes them unsuitable for applications requiring the density's shape characteristics.

Kernel density estimation, covered in the next page, addresses all these limitations by producing smooth, infinitely differentiable estimates (with appropriate kernel choice).

Variable-Width and Adaptive Histograms

The limitations of fixed-width histograms have inspired extensions that allow bin widths to vary across the data range.

Variable-Width Histograms:

Rather than fixing bin width, we can fix the number of points per bin. The resulting histogram has:

Narrow bins in high-density regions (fine resolution where data is abundant)
Wide bins in low-density regions (reduced variance where data is sparse)

Equal-Frequency Binning: $$n_j = \frac{n}{m} \quad \text{for all bins } j$$

The density estimate becomes: $$\hat{f}(x) = \frac{1}{nh_j} \cdot \frac{n}{m} = \frac{1}{mh_j}$$

Narrow $h_j$ → high density estimate; wide $h_j$ → low density estimate.

Advantages:

Automatically adapts to local density
Reduces variance in sparse regions
Maintains resolution in dense regions

Disadvantages:

Bins may have unnatural boundaries
Harder to interpret visually
Doesn't achieve optimal convergence rates

Bayesian Blocks:

A more sophisticated approach uses Bayesian model selection to choose bin boundaries optimally.

Idea: Treat the number and location of bins as unknown parameters. Use Bayesian inference to select the configuration that best balances fit and complexity.

The algorithm proceeds by dynamic programming:

Compute the best single-block representation
Consider all possible places to add a bin boundary
Choose the configuration that maximizes a fitness function penalizing complexity

Formally: Maximize: $$\text{Fitness} = \sum_{j=1}^{m} \text{block_fitness}(B_j) - m \cdot \text{penalty}$$

where block fitness measures how well the constant rate within a bin explains the data, and the penalty prevents overfitting.

Advantages:

Automatically determines both number and location of bins
No ad-hoc bin width selection
Particularly useful for detecting change points in time series

Disadvantages:

Computationally intensive
May produce many narrow bins near modes
Less familiar to general audiences

When to Use Adaptive Histograms

Variable-width histograms are most useful when you have data spanning many orders of magnitude (log-transformed data, heavy-tailed distributions) or when data density varies dramatically across the range. For most applications with moderate variability, fixed-width histograms with good bin width selection suffice.

Histograms in Multiple Dimensions

Extending histograms to multiple dimensions reveals the first serious encounter with the curse of dimensionality—a phenomenon that plagues all nonparametric methods.

Construction in $d$ Dimensions:

For data in $\mathbb{R}^d$, we partition each dimension into bins, creating a grid of hypercubes. If each dimension has $m$ bins:

Total number of bins: $m^d$
Expected points per bin: $n/m^d$

The Curse Emerges:

Consider $n = 1000$ points and $m = 10$ bins per dimension:

Dimension	Total Bins	Points per Bin
1	10	100
2	100	10
3	1,000	1
4	10,000	0.1
5	100,000	0.01
10	10 billion	~0

By dimension 4, we expect fewer than one point per bin on average. Most bins are empty, making density estimation essentially impossible.

Theoretical Analysis:

For a $d$-dimensional histogram with bin widths $h$, the MISE becomes: $$\text{MISE} \approx \frac{h^2}{12} \sum_{i=1}^{d} \int \left(\frac{\partial f}{\partial x_i}\right)^2 dx + \frac{1}{nh^d}$$

Optimizing over $h$: $$h^* = O(n^{-1/(d+2)})$$ $$\text{MISE}^* = O(n^{-2/(d+2)})$$

As $d$ increases, the convergence rate deteriorates:

Dimension	Rate	Points for 10% error
1	n⁻²/³ ≈ n⁻⁰·⁶⁷	~100
2	n⁻¹/² ≈ n⁻⁰·⁵⁰	~1,000
3	n⁻²/⁵ ≈ n⁻⁰·⁴⁰	~10,000
5	n⁻²/⁷ ≈ n⁻⁰·²⁹	~1,000,000
10	n⁻¹/⁶ ≈ n⁻⁰·¹⁷	~10¹⁰

The sample size required for a given accuracy grows exponentially with dimension.

The Bitter Lesson of High Dimensions

Histograms (and indeed all nonparametric density estimators) fail catastrophically in high dimensions. For d > 3, use parametric methods, dimensionality reduction, or density ratio estimation instead. This is not a limitation of histograms specifically—it's a fundamental property of estimating functions in high-dimensional spaces.

Practical Multivariate Approaches:

2D Histograms: Still viable for joint distributions of two variables. Displayed as heatmaps.
Marginal Histograms: Estimate 1D marginal densities separately. Loses dependence structure.
Pair Plots: Grid of 2D histograms for all variable pairs. Good for exploration.
Hexagonal Binning: For 2D, hexagonal bins are visually superior and reduce boundary artifacts.
Dimensionality Reduction + Histogram: Project data to 2-3 dimensions (PCA, t-SNE, UMAP), then histogram.

Practical Implementation Considerations

When implementing histograms in practice, several considerations beyond basic construction affect the quality and utility of results.

Edge Handling:

Inclusion/Exclusion: Standard convention: bins are $[b_{j-1}, b_j)$ except the last bin $[b_{m-1}, b_m]$. This ensures all points are included exactly once.
Boundary Extension: Sometimes extend bins slightly beyond data range to avoid boundary artifacts.
Open-ended Bins: For unbounded data, first and last bins can have infinite extent (but finite count).

Computational Efficiency:

For finding which bin a point falls into:

Equal-width bins: Direct computation: $j = \lfloor (x - b_0) / h \rfloor + 1$. O(1) per point.
Unequal bins: Binary search over bin boundaries. O(log m) per point.

Total complexity: O(n) for equal-width, O(n log m) for unequal.

Memory Efficiency:

Only need to store bin counts, not individual points
For very large data, histograms provide massive compression
Streaming construction: update counts as data arrives, no need to store raw data

Common Pitfalls:

Forgetting to Normalize: Dividing counts by $n$ gives probability, but dividing by $nh$ gives density. Know which you need.
Ignoring Empty Bins: Empty bins have zero density, not undefined. Handle in downstream calculations.
Extreme Values: A single outlier can drastically expand the range, wasting bins on empty regions. Consider trimming or log transformation.
Integer Data: Discrete data may cluster at integers. Align bin boundaries between integers to reveal true structure.
Sparse Regions: Bins with few points have high variance. Don't overinterpret small bumps.

Quality Checks:

Sum of bin areas should equal 1 (for density histogram)
Sum of bin counts should equal n
Visual inspection: do modes align with expected features?
Sensitivity analysis: how much do conclusions change with different bin widths?

histogram_density_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy import stats
 
def histogram_density(data, method='fd'):
    """
    Compute histogram density estimate with various bin selection rules.
    
    Parameters:
    -----------
    data : array-like
        1D array of observations
    method : str
        Bin selection rule: 'sturges', 'scott', 'fd' (Freedman-Diaconis), 
        'sqrt', or 'cv' (cross-validation)
    
    Returns:
    --------
    bin_edges : array
        Edges of the bins
    density : array
        Estimated density at bin centers
    """
    data = np.asarray(data)
    n = len(data)
    
    # Calculate bin width based on method
    if method == 'sturges':
        num_bins = int(1 + np.log2(n))
    elif method == 'scott':
        h = 3.49 * np.std(data, ddof=1) * n**(-1/3)
        num_bins = int(np.ceil((data.max() - data.min()) / h))
    elif method == 'fd':  # Freedman-Diaconis
        iqr = np.percentile(data, 75) - np.percentile(data, 25)
        h = 2 * iqr * n**(-1/3)
        num_bins = int(np.ceil((data.max() - data.min()) / h))
    elif method == 'sqrt':
        num_bins = int(np.sqrt(n))
    else:
        raise ValueError(f"Unknown method: {method}")
    
    num_bins = max(1, num_bins)  # At least one bin
    
    # Compute histogram density
    counts, bin_edges = np.histogram(data, bins=num_bins, density=True)
    bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
    
    return bin_edges, counts, bin_centers
 
# Example usage
np.random.seed(42)
data = np.concatenate([
    np.random.normal(0, 1, 500),
    np.random.normal(4, 0.8, 300)
])
 
for method in ['sturges', 'scott', 'fd']:
    edges, density, centers = histogram_density(data, method)
    print(f"{method}: {len(density)} bins, max density = {density.max():.3f}")

Summary and Path Forward

We've now developed a thorough understanding of histogram density estimation—from basic construction through advanced topics. Let's consolidate the key insights:

Key Takeaways

•Histograms are density estimators when normalized by bin width—they integrate to 1 and approximate the true PDF.
•Bias decreases with smaller bins (less averaging), but variance increases (fewer points per bin). The optimal bin width balances these forces.
•The optimal convergence rate is $O(n^{-2/3})$, slower than KDE's $O(n^{-4/5})$, due to the piecewise-constant structure.
•Bin width selection rules include Sturges (historical, limited), Scott (MISE-optimal for normal), and Freedman-Diaconis (robust). Cross-validation provides data-driven selection.
•Key limitations include discontinuity, bin origin sensitivity, and information loss within bins. These motivate KDE.
•In high dimensions, histograms fail catastrophically due to the curse of dimensionality. Sample requirements grow exponentially with dimension.
•Practical implementation requires careful handling of normalization, edge cases, and empty bins.

What's Next:

The histogram's limitations—particularly discontinuity and bin origin sensitivity—motivate the search for smoother estimators. The next page introduces Kernel Density Estimation (KDE), which addresses these issues elegantly by centering a smooth kernel function at each data point. KDE achieves faster convergence rates, produces differentiable estimates, and eliminates dependence on arbitrary bin boundaries—making it the modern workhorse of nonparametric density estimation.

Page Complete

You now have a comprehensive understanding of histogram density estimation—from the intuitive to the mathematical. While histograms remain valuable for exploration and communication, their limitations will drive your appreciation for the more sophisticated methods ahead.

Histogram Estimation

The Humble Histogram: Foundation of Density Estimation

What You Will Learn

Histogram Construction

Formal Definition:

Let $n_j$ denote the count of observations in bin $B_j$: $$n_j = \sum_{i=1}^{n} \mathbf{1}(x_i \in B_j)$$

where $\mathbf{1}(\cdot)$ is the indicator function.

The histogram density estimator at point $x$ is: $$\hat{f}_h(x) = \frac{n_j}{n \cdot h_j} \quad \text{for } x \in B_j$$

where $h_j = b_j - b_{j-1}$ is the width of bin $B_j$.

For equal-width bins with common width $h = h_j$ for all $j$: $$\hat{f}_h(x) = \frac{n_j}{nh} \quad \text{for } x \in B_j$$

Why Divide by Bin Width?

Verification That Histogram Is a Valid Density:

$$\int_{-\infty}^{\infty} \hat{f}h(x) , dx = \sum{j=1}^{m} \frac{n_j}{nh_j} \cdot h_j = \sum_{j=1}^{m} \frac{n_j}{n} = \frac{\sum_{j=1}^{m} n_j}{n} = \frac{n}{n} = 1$$

Also, $\hat{f}_h(x) \geq 0$ everywhere, so the histogram is indeed a valid probability density function.

Two Critical Choices:

Bin width (h): Controls the resolution/smoothness tradeoff
Bin origin ($b_0$): Determines where bins start

Components of a Histogram Density Estimator
Component	Notation	Role	Key Consideration
Bin boundaries	b₀, b₁, ..., bₘ	Partition the data range	Typically evenly spaced
Bin width	h = bⱼ - bⱼ₋₁	Width of each interval	Main smoothing parameter
Bin count	nⱼ	Points in bin j	Random variable
Number of bins	m	Resolution of partition	m ≈ range/h
Density estimate	f̂(x) = nⱼ/(nh)	Estimated density at x	Piecewise constant

Statistical Properties

Expected Value of Bin Count:

For a point $x$ in bin $B_j$, the bin count $n_j$ follows a binomial distribution: $$n_j \sim \text{Binomial}(n, p_j)$$

where $p_j = \int_{B_j} f(x) , dx = \mathbb{P}(X \in B_j)$ is the probability of landing in bin $B_j$.

Thus: $$\mathbb{E}[n_j] = n \cdot p_j \approx n \cdot h \cdot f(x)$$

where the approximation uses $p_j \approx h \cdot f(x)$ for small bin width.

Bias Analysis:

The expected value of the histogram estimator at a point $x$ in bin $B_j$ is: $$\mathbb{E}[\hat{f}_h(x)] = \mathbb{E}\left[\frac{n_j}{nh}\right] = \frac{p_j}{h}$$

where $c_j = (b_{j-1} + b_j)/2$ is the bin center.

Therefore: $$\mathbb{E}[\hat{f}_h(x)] = f(c_j) + \frac{h^2}{24} f''(c_j) + O(h^4)$$

The bias at point $x$ (which may not equal the bin center) involves two components:

Discretization bias: $f(c_j) - f(x)$ — we estimate density at the bin center, not at $x$
Smoothing bias: $\frac{h^2}{24} f''(c_j)$ — averaging within bins smooths the density

The integrated squared bias is: $$\int \text{Bias}^2[\hat{f}_h(x)] , dx = O(h^2)$$

Bias decreases as $h \to 0$ (smaller bins, less averaging).

Variance Analysis:

Since $n_j \sim \text{Binomial}(n, p_j)$, we have $\text{Var}(n_j) = np_j(1-p_j)$.

For small bin width, $p_j \approx hf(x) \ll 1$, so: $$\text{Var}(n_j) \approx np_j \approx nhf(x)$$

The variance of the histogram estimator is: $$\text{Var}[\hat{f}_h(x)] = \text{Var}\left[\frac{n_j}{nh}\right] = \frac{\text{Var}(n_j)}{n^2h^2} \approx \frac{nhf(x)}{n^2h^2} = \frac{f(x)}{nh}$$

The integrated variance is: $$\int \text{Var}[\hat{f}_h(x)] , dx \approx \frac{1}{nh}$$

Variance decreases as $h$ increases (larger bins, more points per bin) or as $n$ increases.

The Fundamental Tension

Mean Integrated Squared Error (MISE):

Combining bias and variance: $$\text{MISE}(\hat{f}_h) = \int \mathbb{E}[(\hat{f}_h(x) - f(x))^2] , dx = \int \text{Bias}^2 , dx + \int \text{Var} , dx$$

For the histogram: $$\text{MISE}(\hat{f}_h) \approx \frac{h^2}{12} \int [f'(x)]^2 , dx + \frac{1}{nh}$$

(The bias term comes from discretization; the exact coefficient depends on assumptions.)

Optimal Bin Width:

Minimizing MISE with respect to $h$: $$\frac{d}{dh} \text{MISE} = \frac{2h}{12} R(f') - \frac{1}{nh^2} = 0$$

where $R(f') = \int [f'(x)]^2 , dx$ is the roughness of $f'$.

Solving: $$h^* = \left(\frac{6}{n \cdot R(f')}\right)^{1/3}$$

The optimal MISE scales as: $$\text{MISE}^* = O(n^{-2/3})$$

Compare this to KDE's $O(n^{-4/5})$ rate—histograms converge slower due to their piecewise-constant nature.

Histogram Statistical Properties Summary
Property	Formula	Behavior	Implication
Integrated Bias²	O(h²)	Decreases with smaller h	Fine bins reduce approximation error
Integrated Variance	O(1/nh)	Decreases with larger h	Coarse bins reduce sampling error
Optimal h	O(n⁻¹/³)	Shrinks with more data	More data → finer resolution
Optimal MISE	O(n⁻²/³)	Slower than KDE (n⁻⁴/⁵)	Histograms less efficient
Convergence rate	n⁻²/³	Sub-optimal for smooth densities	Piecewise constant limits accuracy

Bin Width Selection Rules

1. Sturges' Rule (1926):

Limitations:

Severely underestimates bins for large $n$ (10,000 points → only 14 bins)
Assumes normality—fails for skewed or multimodal data
Ignores data variability

2. Scott's Rule (1979):

Derived by minimizing MISE assuming a normal reference distribution: $$h = 3.49 \cdot \hat{\sigma} \cdot n^{-1/3}$$

where $\hat{\sigma}$ is the sample standard deviation.

Advantages over Sturges:

Accounts for data variability through $\hat{\sigma}$
Correct $n^{-1/3}$ scaling
Works reasonably for unimodal, symmetric distributions

Limitations:

Normal reference can oversmooth for multimodal data
Standard deviation is sensitive to outliers

3. Freedman-Diaconis Rule (1981):

A robust alternative using the interquartile range (IQR): $$h = 2 \cdot \text{IQR} \cdot n^{-1/3}$$

where $\text{IQR} = Q_3 - Q_1$ (75th percentile minus 25th percentile).

Advantages:

Robust to outliers and heavy tails
Less sensitive to non-normality than Scott's rule
Often the default in statistical software

Limitations:

Can undersmooth for heavy-tailed distributions
Still assumes roughly unimodal symmetric shape

Comparison of Bin Width Selection Rules
Rule	Formula for h	Assumptions	Best For	Weaknesses
Sturges (1926)	range / (1 + log₂n)	Normal data	Small, normal samples	Very few bins for large n
Scott (1979)	3.49 σ̂ n⁻¹/³	Normal reference	Unimodal symmetric	Sensitive to outliers
Freedman-Diaconis	2 · IQR · n⁻¹/³	Robustness needed	Data with outliers	Can undersmooth heavy tails
Doane (1976)	Sturges + skewness correction	Skewed normal	Moderately skewed	Limited for multimodality
sqrt(n)	range / √n	None (rule of thumb)	Quick exploration	No theoretical basis

4. Cross-Validation:

A data-driven approach that doesn't assume a reference distribution:

Leave-One-Out Cross-Validation: $$\text{CV}(h) = \frac{2}{(n-1)h} - \frac{n+1}{(n-1)n^2h} \sum_{j=1}^{m} n_j^2$$

Minimize CV$(h)$ over a grid of $h$ values. This optimizes an unbiased estimate of ISE without knowing $f$.

Advantages:

Fully data-driven
Adapts to any distributional shape
Theoretically optimal under mild conditions

Disadvantages:

Computationally intensive
Solution may not be unique
High variability in selected $h$ for small samples

Practical Recommendation

Histogram Limitations

Fundamental Limitations of Histograms

•Discontinuity: The histogram is a step function—discontinuous at bin boundaries. Real densities are typically continuous, so this introduces artifacts.
•Bin Origin Dependence: Shifting the bin origin can change the appearance of the histogram significantly, even with the same bin width. Features can appear or disappear based on arbitrary placement.
•Information Loss: All points within a bin contribute equally regardless of their exact positions. Two datasets with vastly different within-bin distributions can produce identical histograms.
•Suboptimal Convergence Rate: The $O(n^{-2/3})$ rate is slower than the $O(n^{-4/5})$ achievable by KDE for smooth densities. Histograms waste data.
•Fixed Resolution: A single bin width applies everywhere. But some regions may need finer resolution (near modes) while others need coarser (in the tails).
•Multivariate Curse: In $d$ dimensions, estimating the histogram requires $O(m^d)$ bins. For $d = 10$ with 10 bins per dimension, that's 10 billion bins—most empty.

Demonstration: Bin Origin Sensitivity

Consider data ${0.9, 1.1, 2.9, 3.1}$ representing two clusters near 1 and 3.

Origin at 0, width 1: Bins [0,1), [1,2), [2,3), [3,4) give counts {1, 1, 1, 1}—appears uniform!
Origin at 0.5, width 1: Bins [0.5,1.5), [1.5,2.5), [2.5,3.5) give counts {2, 0, 2}—correctly shows bimodality.

Same data, same bin width, completely different visual impression. The "true" structure (two clusters) is revealed or hidden based on an arbitrary choice.

Addressing Bin Origin Sensitivity

Discontinuity and Derivative Failure:

For many applications, we need not just the density but also its derivatives. For example:

Mode finding: Requires $f'(x) = 0$
Curvature analysis: Requires $f''(x)$
Score matching: Uses $ abla \log f(x)$

Kernel density estimation, covered in the next page, addresses all these limitations by producing smooth, infinitely differentiable estimates (with appropriate kernel choice).

Variable-Width and Adaptive Histograms

The limitations of fixed-width histograms have inspired extensions that allow bin widths to vary across the data range.

Variable-Width Histograms:

Rather than fixing bin width, we can fix the number of points per bin. The resulting histogram has:

Narrow bins in high-density regions (fine resolution where data is abundant)
Wide bins in low-density regions (reduced variance where data is sparse)

Equal-Frequency Binning: $$n_j = \frac{n}{m} \quad \text{for all bins } j$$

The density estimate becomes: $$\hat{f}(x) = \frac{1}{nh_j} \cdot \frac{n}{m} = \frac{1}{mh_j}$$

Narrow $h_j$ → high density estimate; wide $h_j$ → low density estimate.

Advantages:

Automatically adapts to local density
Reduces variance in sparse regions
Maintains resolution in dense regions

Disadvantages:

Bins may have unnatural boundaries
Harder to interpret visually
Doesn't achieve optimal convergence rates

Bayesian Blocks:

A more sophisticated approach uses Bayesian model selection to choose bin boundaries optimally.

Idea: Treat the number and location of bins as unknown parameters. Use Bayesian inference to select the configuration that best balances fit and complexity.

The algorithm proceeds by dynamic programming:

Compute the best single-block representation
Consider all possible places to add a bin boundary
Choose the configuration that maximizes a fitness function penalizing complexity

Formally: Maximize: $$\text{Fitness} = \sum_{j=1}^{m} \text{block_fitness}(B_j) - m \cdot \text{penalty}$$

where block fitness measures how well the constant rate within a bin explains the data, and the penalty prevents overfitting.

Advantages:

Automatically determines both number and location of bins
No ad-hoc bin width selection
Particularly useful for detecting change points in time series

Disadvantages:

Computationally intensive
May produce many narrow bins near modes
Less familiar to general audiences

When to Use Adaptive Histograms

Histograms in Multiple Dimensions

Extending histograms to multiple dimensions reveals the first serious encounter with the curse of dimensionality—a phenomenon that plagues all nonparametric methods.

Construction in $d$ Dimensions:

For data in $\mathbb{R}^d$, we partition each dimension into bins, creating a grid of hypercubes. If each dimension has $m$ bins:

Total number of bins: $m^d$
Expected points per bin: $n/m^d$

The Curse Emerges:

Consider $n = 1000$ points and $m = 10$ bins per dimension:

Dimension	Total Bins	Points per Bin
1	10	100
2	100	10
3	1,000	1
4	10,000	0.1
5	100,000	0.01
10	10 billion	~0

By dimension 4, we expect fewer than one point per bin on average. Most bins are empty, making density estimation essentially impossible.

Theoretical Analysis:

For a $d$-dimensional histogram with bin widths $h$, the MISE becomes: $$\text{MISE} \approx \frac{h^2}{12} \sum_{i=1}^{d} \int \left(\frac{\partial f}{\partial x_i}\right)^2 dx + \frac{1}{nh^d}$$

Optimizing over $h$: $$h^* = O(n^{-1/(d+2)})$$ $$\text{MISE}^* = O(n^{-2/(d+2)})$$

As $d$ increases, the convergence rate deteriorates:

Dimension	Rate	Points for 10% error
1	n⁻²/³ ≈ n⁻⁰·⁶⁷	~100
2	n⁻¹/² ≈ n⁻⁰·⁵⁰	~1,000
3	n⁻²/⁵ ≈ n⁻⁰·⁴⁰	~10,000
5	n⁻²/⁷ ≈ n⁻⁰·²⁹	~1,000,000
10	n⁻¹/⁶ ≈ n⁻⁰·¹⁷	~10¹⁰

The sample size required for a given accuracy grows exponentially with dimension.

The Bitter Lesson of High Dimensions

Practical Multivariate Approaches:

2D Histograms: Still viable for joint distributions of two variables. Displayed as heatmaps.
Marginal Histograms: Estimate 1D marginal densities separately. Loses dependence structure.
Pair Plots: Grid of 2D histograms for all variable pairs. Good for exploration.
Hexagonal Binning: For 2D, hexagonal bins are visually superior and reduce boundary artifacts.
Dimensionality Reduction + Histogram: Project data to 2-3 dimensions (PCA, t-SNE, UMAP), then histogram.

Practical Implementation Considerations

When implementing histograms in practice, several considerations beyond basic construction affect the quality and utility of results.

Edge Handling:

Inclusion/Exclusion: Standard convention: bins are $[b_{j-1}, b_j)$ except the last bin $[b_{m-1}, b_m]$. This ensures all points are included exactly once.
Boundary Extension: Sometimes extend bins slightly beyond data range to avoid boundary artifacts.
Open-ended Bins: For unbounded data, first and last bins can have infinite extent (but finite count).

Computational Efficiency:

For finding which bin a point falls into:

Equal-width bins: Direct computation: $j = \lfloor (x - b_0) / h \rfloor + 1$. O(1) per point.
Unequal bins: Binary search over bin boundaries. O(log m) per point.

Total complexity: O(n) for equal-width, O(n log m) for unequal.

Memory Efficiency:

Only need to store bin counts, not individual points
For very large data, histograms provide massive compression
Streaming construction: update counts as data arrives, no need to store raw data

Common Pitfalls:

Forgetting to Normalize: Dividing counts by $n$ gives probability, but dividing by $nh$ gives density. Know which you need.
Ignoring Empty Bins: Empty bins have zero density, not undefined. Handle in downstream calculations.
Extreme Values: A single outlier can drastically expand the range, wasting bins on empty regions. Consider trimming or log transformation.
Integer Data: Discrete data may cluster at integers. Align bin boundaries between integers to reveal true structure.
Sparse Regions: Bins with few points have high variance. Don't overinterpret small bumps.

Quality Checks:

Sum of bin areas should equal 1 (for density histogram)
Sum of bin counts should equal n
Visual inspection: do modes align with expected features?
Sensitivity analysis: how much do conclusions change with different bin widths?

histogram_density_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy import stats
 
def histogram_density(data, method='fd'):
    """
    Compute histogram density estimate with various bin selection rules.
    
    Parameters:
    -----------
    data : array-like
        1D array of observations
    method : str
        Bin selection rule: 'sturges', 'scott', 'fd' (Freedman-Diaconis), 
        'sqrt', or 'cv' (cross-validation)
    
    Returns:
    --------
    bin_edges : array
        Edges of the bins
    density : array
        Estimated density at bin centers
    """
    data = np.asarray(data)
    n = len(data)
    
    # Calculate bin width based on method
    if method == 'sturges':
        num_bins = int(1 + np.log2(n))
    elif method == 'scott':
        h = 3.49 * np.std(data, ddof=1) * n**(-1/3)
        num_bins = int(np.ceil((data.max() - data.min()) / h))
    elif method == 'fd':  # Freedman-Diaconis
        iqr = np.percentile(data, 75) - np.percentile(data, 25)
        h = 2 * iqr * n**(-1/3)
        num_bins = int(np.ceil((data.max() - data.min()) / h))
    elif method == 'sqrt':
        num_bins = int(np.sqrt(n))
    else:
        raise ValueError(f"Unknown method: {method}")
    
    num_bins = max(1, num_bins)  # At least one bin
    
    # Compute histogram density
    counts, bin_edges = np.histogram(data, bins=num_bins, density=True)
    bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
    
    return bin_edges, counts, bin_centers
 
# Example usage
np.random.seed(42)
data = np.concatenate([
    np.random.normal(0, 1, 500),
    np.random.normal(4, 0.8, 300)
])
 
for method in ['sturges', 'scott', 'fd']:
    edges, density, centers = histogram_density(data, method)
    print(f"{method}: {len(density)} bins, max density = {density.max():.3f}")

Summary and Path Forward

We've now developed a thorough understanding of histogram density estimation—from basic construction through advanced topics. Let's consolidate the key insights:

Key Takeaways

•Histograms are density estimators when normalized by bin width—they integrate to 1 and approximate the true PDF.
•Bias decreases with smaller bins (less averaging), but variance increases (fewer points per bin). The optimal bin width balances these forces.
•The optimal convergence rate is $O(n^{-2/3})$, slower than KDE's $O(n^{-4/5})$, due to the piecewise-constant structure.
•Bin width selection rules include Sturges (historical, limited), Scott (MISE-optimal for normal), and Freedman-Diaconis (robust). Cross-validation provides data-driven selection.
•Key limitations include discontinuity, bin origin sensitivity, and information loss within bins. These motivate KDE.
•In high dimensions, histograms fail catastrophically due to the curse of dimensionality. Sample requirements grow exponentially with dimension.
•Practical implementation requires careful handling of normalization, edge cases, and empty bins.

What's Next:

Page Complete