Loading content...
The histogram is perhaps the most recognizable statistical graphic in existence. From introductory statistics courses to high-dimensional data exploration, histograms serve as our primary window into the distributional properties of data. Yet beneath this familiar visualization lies a rigorous framework for nonparametric density estimation.
Introduced by Karl Pearson in 1895 (the term "histogram" itself was coined in a 1891 lecture), the histogram remains ubiquitous for good reason: it is simple to compute, easy to interpret, and often sufficient for exploratory analysis. But histograms are more than just bar charts—they are consistent density estimators with well-understood mathematical properties.
In this page, we'll move far beyond the superficial understanding of histograms. We'll explore their mathematical foundations, analyze their statistical properties, understand their limitations, and learn principled approaches to one of the most vexing questions in practical statistics: How do we choose the bin width?
By the end of this page, you will understand histograms as formal density estimators, derive their bias and variance properties, learn multiple bin width selection rules (including Sturges, Scott, and Freedman-Diaconis), and recognize histograms as a stepping stone to more sophisticated methods like kernel density estimation.
Let's begin with a precise definition. Given data $\mathcal{D} = {x_1, x_2, \ldots, x_n}$, a histogram partitions the data range into bins and estimates density based on the proportion of points falling into each bin.
Formal Definition:
Choose bin boundaries $b_0 < b_1 < \cdots < b_m$ such that $b_0 \leq \min_i x_i$ and $b_m \geq \max_i x_i$. Define bins $B_j = [b_{j-1}, b_j)$ for $j = 1, \ldots, m$ (with the last bin typically closed on both ends).
Let $n_j$ denote the count of observations in bin $B_j$: $$n_j = \sum_{i=1}^{n} \mathbf{1}(x_i \in B_j)$$
where $\mathbf{1}(\cdot)$ is the indicator function.
The histogram density estimator at point $x$ is: $$\hat{f}_h(x) = \frac{n_j}{n \cdot h_j} \quad \text{for } x \in B_j$$
where $h_j = b_j - b_{j-1}$ is the width of bin $B_j$.
For equal-width bins with common width $h = h_j$ for all $j$: $$\hat{f}_h(x) = \frac{n_j}{nh} \quad \text{for } x \in B_j$$
A critical but often overlooked point: histograms for density estimation must divide by bin width (not just by n). This ensures the histogram integrates to 1 (a valid probability density). The frequency histogram (just n_j/n) gives probability mass per bin, not density. For density, we need mass per unit width.
Verification That Histogram Is a Valid Density:
$$\int_{-\infty}^{\infty} \hat{f}h(x) , dx = \sum{j=1}^{m} \frac{n_j}{nh_j} \cdot h_j = \sum_{j=1}^{m} \frac{n_j}{n} = \frac{\sum_{j=1}^{m} n_j}{n} = \frac{n}{n} = 1$$
Also, $\hat{f}_h(x) \geq 0$ everywhere, so the histogram is indeed a valid probability density function.
Two Critical Choices:
The bin width is by far the more important choice—it determines the fundamental properties of the estimate. The bin origin has minor effects but can shift features slightly. We'll focus primarily on bin width selection.
| Component | Notation | Role | Key Consideration |
|---|---|---|---|
| Bin boundaries | b₀, b₁, ..., bₘ | Partition the data range | Typically evenly spaced |
| Bin width | h = bⱼ - bⱼ₋₁ | Width of each interval | Main smoothing parameter |
| Bin count | nⱼ | Points in bin j | Random variable |
| Number of bins | m | Resolution of partition | m ≈ range/h |
| Density estimate | f̂(x) = nⱼ/(nh) | Estimated density at x | Piecewise constant |
Understanding the statistical properties of the histogram estimator requires analyzing its bias, variance, and mean squared error. These calculations reveal how bin width controls the fundamental tradeoff between smoothness and fidelity to data.
Expected Value of Bin Count:
For a point $x$ in bin $B_j$, the bin count $n_j$ follows a binomial distribution: $$n_j \sim \text{Binomial}(n, p_j)$$
where $p_j = \int_{B_j} f(x) , dx = \mathbb{P}(X \in B_j)$ is the probability of landing in bin $B_j$.
Thus: $$\mathbb{E}[n_j] = n \cdot p_j \approx n \cdot h \cdot f(x)$$
where the approximation uses $p_j \approx h \cdot f(x)$ for small bin width.
Bias Analysis:
The expected value of the histogram estimator at a point $x$ in bin $B_j$ is: $$\mathbb{E}[\hat{f}_h(x)] = \mathbb{E}\left[\frac{n_j}{nh}\right] = \frac{p_j}{h}$$
Using Taylor expansion, if $f$ is twice differentiable, the probability of falling in bin $B_j = [b_{j-1}, b_j)$ is: $$p_j = \int_{b_{j-1}}^{b_j} f(t) , dt = h \cdot f(c_j) + \frac{h^3}{24} f''(c_j) + O(h^5)$$
where $c_j = (b_{j-1} + b_j)/2$ is the bin center.
Therefore: $$\mathbb{E}[\hat{f}_h(x)] = f(c_j) + \frac{h^2}{24} f''(c_j) + O(h^4)$$
The bias at point $x$ (which may not equal the bin center) involves two components:
The integrated squared bias is: $$\int \text{Bias}^2[\hat{f}_h(x)] , dx = O(h^2)$$
Bias decreases as $h \to 0$ (smaller bins, less averaging).
Variance Analysis:
Since $n_j \sim \text{Binomial}(n, p_j)$, we have $\text{Var}(n_j) = np_j(1-p_j)$.
For small bin width, $p_j \approx hf(x) \ll 1$, so: $$\text{Var}(n_j) \approx np_j \approx nhf(x)$$
The variance of the histogram estimator is: $$\text{Var}[\hat{f}_h(x)] = \text{Var}\left[\frac{n_j}{nh}\right] = \frac{\text{Var}(n_j)}{n^2h^2} \approx \frac{nhf(x)}{n^2h^2} = \frac{f(x)}{nh}$$
The integrated variance is: $$\int \text{Var}[\hat{f}_h(x)] , dx \approx \frac{1}{nh}$$
Variance decreases as $h$ increases (larger bins, more points per bin) or as $n$ increases.
Note the conflicting behavior: Bias decreases with smaller h (wanting h → 0), but Variance decreases with larger h (wanting h → ∞). This is the bias-variance tradeoff in action. The optimal h must balance these opposing forces.
Mean Integrated Squared Error (MISE):
Combining bias and variance: $$\text{MISE}(\hat{f}_h) = \int \mathbb{E}[(\hat{f}_h(x) - f(x))^2] , dx = \int \text{Bias}^2 , dx + \int \text{Var} , dx$$
For the histogram: $$\text{MISE}(\hat{f}_h) \approx \frac{h^2}{12} \int [f'(x)]^2 , dx + \frac{1}{nh}$$
(The bias term comes from discretization; the exact coefficient depends on assumptions.)
Optimal Bin Width:
Minimizing MISE with respect to $h$: $$\frac{d}{dh} \text{MISE} = \frac{2h}{12} R(f') - \frac{1}{nh^2} = 0$$
where $R(f') = \int [f'(x)]^2 , dx$ is the roughness of $f'$.
Solving: $$h^* = \left(\frac{6}{n \cdot R(f')}\right)^{1/3}$$
The optimal MISE scales as: $$\text{MISE}^* = O(n^{-2/3})$$
Compare this to KDE's $O(n^{-4/5})$ rate—histograms converge slower due to their piecewise-constant nature.
| Property | Formula | Behavior | Implication |
|---|---|---|---|
| Integrated Bias² | O(h²) | Decreases with smaller h | Fine bins reduce approximation error |
| Integrated Variance | O(1/nh) | Decreases with larger h | Coarse bins reduce sampling error |
| Optimal h | O(n⁻¹/³) | Shrinks with more data | More data → finer resolution |
| Optimal MISE | O(n⁻²/³) | Slower than KDE (n⁻⁴/⁵) | Histograms less efficient |
| Convergence rate | n⁻²/³ | Sub-optimal for smooth densities | Piecewise constant limits accuracy |
The theoretical optimal bin width depends on $R(f')$, which involves the unknown density $f$—a classic chicken-and-egg problem. Practical bin width selection rules make assumptions or use data-driven approaches to estimate reasonable values.
1. Sturges' Rule (1926):
One of the oldest rules, based on the assumption that data follow a normal distribution: $$m = 1 + \log_2(n) = 1 + 3.322 \cdot \log_{10}(n)$$ $$h = \frac{\text{range}}{m} = \frac{\max(x) - \min(x)}{1 + \log_2(n)}$$
Derivation: Sturges assumed the data come from a normal distribution and that the bin counts should follow a binomial distribution approximating the normal. Under this idealized scenario, $1 + \log_2(n)$ bins suffice.
Limitations:
2. Scott's Rule (1979):
Derived by minimizing MISE assuming a normal reference distribution: $$h = 3.49 \cdot \hat{\sigma} \cdot n^{-1/3}$$
where $\hat{\sigma}$ is the sample standard deviation.
Derivation: For a normal density $N(\mu, \sigma^2)$, we have $R(f') = \frac{1}{2\sqrt{\pi}\sigma^3}$. Substituting into the optimal bin width formula gives: $$h^* = \left(\frac{24\sqrt{\pi}}{n}\right)^{1/3} \sigma \approx 3.49 \cdot \sigma \cdot n^{-1/3}$$
Advantages over Sturges:
Limitations:
3. Freedman-Diaconis Rule (1981):
A robust alternative using the interquartile range (IQR): $$h = 2 \cdot \text{IQR} \cdot n^{-1/3}$$
where $\text{IQR} = Q_3 - Q_1$ (75th percentile minus 25th percentile).
Rationale: For a normal distribution, $\text{IQR} \approx 1.35\sigma$, so this gives approximately the same result as Scott's rule. However, IQR is more robust to outliers—even 25% of the data being outliers won't affect it.
Advantages:
Limitations:
| Rule | Formula for h | Assumptions | Best For | Weaknesses |
|---|---|---|---|---|
| Sturges (1926) | range / (1 + log₂n) | Normal data | Small, normal samples | Very few bins for large n |
| Scott (1979) | 3.49 σ̂ n⁻¹/³ | Normal reference | Unimodal symmetric | Sensitive to outliers |
| Freedman-Diaconis | 2 · IQR · n⁻¹/³ | Robustness needed | Data with outliers | Can undersmooth heavy tails |
| Doane (1976) | Sturges + skewness correction | Skewed normal | Moderately skewed | Limited for multimodality |
| sqrt(n) | range / √n | None (rule of thumb) | Quick exploration | No theoretical basis |
4. Cross-Validation:
A data-driven approach that doesn't assume a reference distribution:
Leave-One-Out Cross-Validation: $$\text{CV}(h) = \frac{2}{(n-1)h} - \frac{n+1}{(n-1)n^2h} \sum_{j=1}^{m} n_j^2$$
Minimize CV$(h)$ over a grid of $h$ values. This optimizes an unbiased estimate of ISE without knowing $f$.
Advantages:
Disadvantages:
For exploratory analysis, start with Freedman-Diaconis (robust default), then try multiple bin widths to see how features change. Sturges is only appropriate for small samples from nearly-normal distributions. For publication-quality figures, consider cross-validation or simply choose a width that best reveals the structure you want to communicate.
Despite their utility, histograms have fundamental limitations that motivated the development of more sophisticated density estimators. Understanding these limitations is essential for knowing when to use alternative methods.
Demonstration: Bin Origin Sensitivity
Consider data ${0.9, 1.1, 2.9, 3.1}$ representing two clusters near 1 and 3.
Same data, same bin width, completely different visual impression. The "true" structure (two clusters) is revealed or hidden based on an arbitrary choice.
One mitigation is the Averaged Shifted Histogram (ASH), which averages multiple histograms with different origins. This produces a smoother estimate but adds complexity. Kernel density estimation solves these issues more elegantly by using points themselves as centers.
Discontinuity and Derivative Failure:
For many applications, we need not just the density but also its derivatives. For example:
Histograms fail here entirely—their derivative is zero almost everywhere (within bins) and undefined at bin boundaries. This makes them unsuitable for applications requiring the density's shape characteristics.
Kernel density estimation, covered in the next page, addresses all these limitations by producing smooth, infinitely differentiable estimates (with appropriate kernel choice).
The limitations of fixed-width histograms have inspired extensions that allow bin widths to vary across the data range.
Variable-Width Histograms:
Rather than fixing bin width, we can fix the number of points per bin. The resulting histogram has:
Equal-Frequency Binning: $$n_j = \frac{n}{m} \quad \text{for all bins } j$$
The density estimate becomes: $$\hat{f}(x) = \frac{1}{nh_j} \cdot \frac{n}{m} = \frac{1}{mh_j}$$
Narrow $h_j$ → high density estimate; wide $h_j$ → low density estimate.
Advantages:
Disadvantages:
Bayesian Blocks:
A more sophisticated approach uses Bayesian model selection to choose bin boundaries optimally.
Idea: Treat the number and location of bins as unknown parameters. Use Bayesian inference to select the configuration that best balances fit and complexity.
The algorithm proceeds by dynamic programming:
Formally: Maximize: $$\text{Fitness} = \sum_{j=1}^{m} \text{block_fitness}(B_j) - m \cdot \text{penalty}$$
where block fitness measures how well the constant rate within a bin explains the data, and the penalty prevents overfitting.
Advantages:
Disadvantages:
Variable-width histograms are most useful when you have data spanning many orders of magnitude (log-transformed data, heavy-tailed distributions) or when data density varies dramatically across the range. For most applications with moderate variability, fixed-width histograms with good bin width selection suffice.
Extending histograms to multiple dimensions reveals the first serious encounter with the curse of dimensionality—a phenomenon that plagues all nonparametric methods.
Construction in $d$ Dimensions:
For data in $\mathbb{R}^d$, we partition each dimension into bins, creating a grid of hypercubes. If each dimension has $m$ bins:
The Curse Emerges:
Consider $n = 1000$ points and $m = 10$ bins per dimension:
| Dimension | Total Bins | Points per Bin |
|---|---|---|
| 1 | 10 | 100 |
| 2 | 100 | 10 |
| 3 | 1,000 | 1 |
| 4 | 10,000 | 0.1 |
| 5 | 100,000 | 0.01 |
| 10 | 10 billion | ~0 |
By dimension 4, we expect fewer than one point per bin on average. Most bins are empty, making density estimation essentially impossible.
Theoretical Analysis:
For a $d$-dimensional histogram with bin widths $h$, the MISE becomes: $$\text{MISE} \approx \frac{h^2}{12} \sum_{i=1}^{d} \int \left(\frac{\partial f}{\partial x_i}\right)^2 dx + \frac{1}{nh^d}$$
Optimizing over $h$: $$h^* = O(n^{-1/(d+2)})$$ $$\text{MISE}^* = O(n^{-2/(d+2)})$$
As $d$ increases, the convergence rate deteriorates:
| Dimension | Rate | Points for 10% error |
|---|---|---|
| 1 | n⁻²/³ ≈ n⁻⁰·⁶⁷ | ~100 |
| 2 | n⁻¹/² ≈ n⁻⁰·⁵⁰ | ~1,000 |
| 3 | n⁻²/⁵ ≈ n⁻⁰·⁴⁰ | ~10,000 |
| 5 | n⁻²/⁷ ≈ n⁻⁰·²⁹ | ~1,000,000 |
| 10 | n⁻¹/⁶ ≈ n⁻⁰·¹⁷ | ~10¹⁰ |
The sample size required for a given accuracy grows exponentially with dimension.
Histograms (and indeed all nonparametric density estimators) fail catastrophically in high dimensions. For d > 3, use parametric methods, dimensionality reduction, or density ratio estimation instead. This is not a limitation of histograms specifically—it's a fundamental property of estimating functions in high-dimensional spaces.
Practical Multivariate Approaches:
2D Histograms: Still viable for joint distributions of two variables. Displayed as heatmaps.
Marginal Histograms: Estimate 1D marginal densities separately. Loses dependence structure.
Pair Plots: Grid of 2D histograms for all variable pairs. Good for exploration.
Hexagonal Binning: For 2D, hexagonal bins are visually superior and reduce boundary artifacts.
Dimensionality Reduction + Histogram: Project data to 2-3 dimensions (PCA, t-SNE, UMAP), then histogram.
When implementing histograms in practice, several considerations beyond basic construction affect the quality and utility of results.
Edge Handling:
Computational Efficiency:
For finding which bin a point falls into:
Total complexity: O(n) for equal-width, O(n log m) for unequal.
Memory Efficiency:
Common Pitfalls:
Forgetting to Normalize: Dividing counts by $n$ gives probability, but dividing by $nh$ gives density. Know which you need.
Ignoring Empty Bins: Empty bins have zero density, not undefined. Handle in downstream calculations.
Extreme Values: A single outlier can drastically expand the range, wasting bins on empty regions. Consider trimming or log transformation.
Integer Data: Discrete data may cluster at integers. Align bin boundaries between integers to reveal true structure.
Sparse Regions: Bins with few points have high variance. Don't overinterpret small bumps.
Quality Checks:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom scipy import stats def histogram_density(data, method='fd'): """ Compute histogram density estimate with various bin selection rules. Parameters: ----------- data : array-like 1D array of observations method : str Bin selection rule: 'sturges', 'scott', 'fd' (Freedman-Diaconis), 'sqrt', or 'cv' (cross-validation) Returns: -------- bin_edges : array Edges of the bins density : array Estimated density at bin centers """ data = np.asarray(data) n = len(data) # Calculate bin width based on method if method == 'sturges': num_bins = int(1 + np.log2(n)) elif method == 'scott': h = 3.49 * np.std(data, ddof=1) * n**(-1/3) num_bins = int(np.ceil((data.max() - data.min()) / h)) elif method == 'fd': # Freedman-Diaconis iqr = np.percentile(data, 75) - np.percentile(data, 25) h = 2 * iqr * n**(-1/3) num_bins = int(np.ceil((data.max() - data.min()) / h)) elif method == 'sqrt': num_bins = int(np.sqrt(n)) else: raise ValueError(f"Unknown method: {method}") num_bins = max(1, num_bins) # At least one bin # Compute histogram density counts, bin_edges = np.histogram(data, bins=num_bins, density=True) bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:]) return bin_edges, counts, bin_centers # Example usagenp.random.seed(42)data = np.concatenate([ np.random.normal(0, 1, 500), np.random.normal(4, 0.8, 300)]) for method in ['sturges', 'scott', 'fd']: edges, density, centers = histogram_density(data, method) print(f"{method}: {len(density)} bins, max density = {density.max():.3f}")We've now developed a thorough understanding of histogram density estimation—from basic construction through advanced topics. Let's consolidate the key insights:
What's Next:
The histogram's limitations—particularly discontinuity and bin origin sensitivity—motivate the search for smoother estimators. The next page introduces Kernel Density Estimation (KDE), which addresses these issues elegantly by centering a smooth kernel function at each data point. KDE achieves faster convergence rates, produces differentiable estimates, and eliminates dependence on arbitrary bin boundaries—making it the modern workhorse of nonparametric density estimation.
You now have a comprehensive understanding of histogram density estimation—from the intuitive to the mathematical. While histograms remain valuable for exploration and communication, their limitations will drive your appreciation for the more sophisticated methods ahead.