Loading content...
Min/max values and uniform distribution assumptions work adequately when data is evenly spread across ranges. But real-world data is rarely so cooperative. Consider a salary column where:
A simple uniform-range calculation for salary > 100000 would estimate 80% selectivity (based on the range 0-500K). The true selectivity is just 10%. Such errors lead to catastrophically wrong query plans.
Histograms solve this problem by dividing the value range into buckets that capture the actual data distribution, enabling precise selectivity estimates for any range predicate.
By the end of this page, you will understand different histogram types (equi-width, equi-depth, hybrid), how databases construct histograms from sample data, how histograms are used for range selectivity estimation, and when histograms provide significant value over simpler statistics.
In the context of database statistics, a histogram is a data structure that approximates the distribution of values in a column by dividing the value domain into buckets. Each bucket represents a range of values and stores summary statistics about that range.
Histogram Anatomy:
Column: salary (range $0 - $500,000)
Bucket | Range | Row Count | Frequency
-------+--------------------+-----------+-----------
1 | $0 - $40,000 | 50,000 | 5%
2 | $40,000 - $60,000 | 600,000 | 60%
3 | $60,000 - $100,000 | 300,000 | 30%
4 | $100,000 - $500,000| 50,000 | 5%
----------
1,000,000 100%
With this histogram, the optimizer can precisely estimate:
salary > 100000: ~5% (bucket 4 only)salary BETWEEN 40000 AND 60000: ~60% (bucket 2)salary < 60000: ~65% (buckets 1 + 2)A histogram is an approximation, not the exact distribution. With k buckets, a histogram can distinguish at most k+1 distinct ranges. Real columns may have millions of distinct values. The art of histogram design is choosing bucket boundaries that maximize estimation accuracy with limited storage.
The two fundamental histogram designs differ in how they partition the value domain into buckets.
Equi-Depth (Equal-Height) Histograms
Divide values so each bucket contains approximately the same number of rows. Bucket widths vary based on data density.
Construction:
rows_per_bucket = total_rows / num_buckets
Sort all values, then partition so each bucket has ~rows_per_bucket rows
Example (Salary with 4 buckets, 1M rows):
rows_per_bucket = 1,000,000 / 4 = 250,000
Bucket | Range | Count
-------+--------------------------+--------
1 | $0 - $45,000 | 250,000
2 | $45,000 - $55,000 | 250,000
3 | $55,000 - $72,000 | 250,000
4 | $72,000 - $500,000 | 250,000
Advantages:
Disadvantages:
| Aspect | Equi-Width | Equi-Depth | Hybrid/MCV |
|---|---|---|---|
| Construction Complexity | O(n) | O(n log n) sorting | O(n log n) |
| Storage per Bucket | Boundary + count | Boundary + count | Boundary + count + MCVs |
| Uniform Data Accuracy | Excellent | Good | Good |
| Skewed Data Accuracy | Poor | Good | Excellent |
| Range Query Support | Good | Excellent | Excellent |
| Point Query Support | Poor | Moderate | Excellent (MCVs) |
Understanding how databases build histograms clarifies their limitations and helps interpret their contents.
Equi-Depth Histogram Construction:
Algorithm: BuildEquiDepthHistogram(column, num_buckets)
1. SAMPLE: Collect sample of rows (e.g., 30,000 rows)
sample = RandomSample(column, sample_size)
2. SORT: Sort sampled values
sorted_values = Sort(sample)
3. PARTITION: Divide into equal-count buckets
rows_per_bucket = sample_size / num_buckets
FOR i = 0 TO num_buckets - 1:
start_idx = i * rows_per_bucket
end_idx = (i + 1) * rows_per_bucket - 1
bucket[i].low = sorted_values[start_idx]
bucket[i].high = sorted_values[end_idx]
bucket[i].frequency = rows_per_bucket / sample_size
4. EXTRAPOLATE: Scale frequencies to full table
FOR each bucket:
bucket.estimated_rows = bucket.frequency * table_row_count
5. STORE: Write histogram to system catalog
Handling Edge Cases:
Repeated Values at Bucket Boundaries:
Sorted sample: [10, 10, 10, 10, 20, 20, 30, 40, 50, 60]
With 2 buckets, boundary falls on repeated 10s
Solution 1: Include all identical values in same bucket
Bucket 1: [10, 10, 10, 10, 20] — 5 values
Bucket 2: [20, 30, 40, 50, 60] — 5 values (20 split)
Solution 2: Popular value extraction
MCV: 10 (frequency 0.4)
Histogram: [20, 30], [40, 50, 60]
Sparse Regions:
Sorted sample: [1, 2, 3, 1000000, 1000001]
With 2 buckets:
Bucket 1: [1, 2, 3] — dense region
Bucket 2: [1000000, 1000001] — sparse, long range
Query: value = 500000
Bucket 2 contains it, but actual frequency is 0!
This is why equi-depth outperforms equi-width for skewed data—the sparse region gets a single wide bucket instead of many empty ones.
123456789101112131415161718192021222324252627282930313233
-- PostgreSQL builds histograms during ANALYZE-- Control histogram resolution with statistics_target -- Default: 100 histogram bucketsSHOW default_statistics_target; -- Increase precision for specific columnALTER TABLE orders ALTER COLUMN amount SET STATISTICS 500; -- Up to 500 buckets -- Force histogram rebuildANALYZE orders (amount); -- View resulting histogramSELECT attname, array_length(histogram_bounds, 1) - 1 AS num_buckets, histogram_bounds[1] AS min_bound, histogram_bounds[array_length(histogram_bounds, 1)] AS max_boundFROM pg_statsWHERE tablename = 'orders' AND attname = 'amount'; -- Detailed bucket examinationSELECT bucket_num, histogram_bounds[bucket_num] AS bucket_start, histogram_bounds[bucket_num + 1] AS bucket_end, 1.0 / (array_length(histogram_bounds, 1) - 1) AS fraction_per_bucketFROM pg_stats, generate_series(1, array_length(histogram_bounds, 1) - 1) AS bucket_numWHERE tablename = 'orders' AND attname = 'amount';Histograms transform range predicate selectivity from crude estimation to precise calculation. Here's how the optimizer uses histograms:
Algorithm: Histogram Range Selectivity
Function: EstimateRangeSelectivity(histogram, low, high)
1. LOCATE BUCKETS: Find buckets overlapping [low, high]
start_bucket = FindBucket(low)
end_bucket = FindBucket(high)
2. FULL BUCKETS: Sum frequency of fully contained buckets
full_selectivity = SUM(bucket.frequency)
for buckets between start_bucket and end_bucket
3. PARTIAL BUCKETS: Interpolate for boundary buckets
For start_bucket (if partially overlapped):
overlap_fraction = (bucket.high - low) / bucket.width
start_contrib = bucket.frequency * overlap_fraction
For end_bucket (if partially overlapped):
overlap_fraction = (high - bucket.low) / bucket.width
end_contrib = bucket.frequency * overlap_fraction
4. COMBINE: Return total selectivity
RETURN start_contrib + full_selectivity + end_contrib
Worked Example:
Histogram for 'price' column (equi-depth, 4 buckets, 100K rows each):
Bucket | Range | Count | Frequency
-------+--------------+--------+-----------
1 | $0 - $25 | 100,000| 25%
2 | $25 - $50 | 100,000| 25%
3 | $50 - $100 | 100,000| 25%
4 | $100 - $500 | 100,000| 25%
Query: WHERE price BETWEEN 30 AND 120
1. Locate buckets:
- $30 falls in bucket 2 (range $25-$50)
- $120 falls in bucket 4 (range $100-$500)
2. Full buckets:
- Bucket 3 ($50-$100) is fully contained: 25%
3. Partial buckets:
- Bucket 2: overlap = ($50 - $30) / ($50 - $25) = 20/25 = 80%
contribution = 25% × 0.80 = 20%
- Bucket 4: overlap = ($120 - $100) / ($500 - $100) = 20/400 = 5%
contribution = 25% × 0.05 = 1.25%
4. Total selectivity: 20% + 25% + 1.25% = 46.25%
Estimated rows: 400,000 × 0.4625 = 185,000
Histogram interpolation assumes uniform distribution within each bucket. If bucket 4 contains mostly values near $100 (not evenly spread to $500), the 1.25% estimate is wrong. More buckets reduce this error at the cost of more storage. This is the fundamental accuracy/space tradeoff.
The number of histogram buckets directly impacts estimation accuracy. More buckets capture finer distribution details but consume more storage and increase maintenance overhead.
| Bucket Count | Storage | Accuracy | Best For |
|---|---|---|---|
| 10-20 | ~200 bytes | Coarse | Low-cardinality, uniform data |
| 50-100 | ~1 KB | Good | Most OLTP columns (default) |
| 200-256 | ~3 KB | Very good | Skewed high-cardinality columns |
| 500-1000 | ~10 KB | Excellent | Critical columns, complex distributions |
Estimation Error Analysis:
For an equi-depth histogram with k buckets:
Maximum estimation error for any range query:
error ≤ 2/k (two bucket-boundary approximations)
With 100 buckets: max error ≤ 2% per bucket boundary
With 200 buckets: max error ≤ 1% per bucket boundary
When to Increase Bucket Count:
12345678910111213141516171819202122
-- Check current statistics targetSELECT attname, attstattarget -- -1 means use defaultFROM pg_attribute aJOIN pg_class c ON a.attrelid = c.oidWHERE c.relname = 'orders'; -- Increase for specific columnALTER TABLE orders ALTER COLUMN amount SET STATISTICS 500; -- After changing, must re-analyzeANALYZE orders (amount); -- Verify bucket count increasedSELECT attname, array_length(histogram_bounds, 1) AS num_bounds, array_length(most_common_vals, 1) AS num_mcvsFROM pg_statsWHERE tablename = 'orders';Histograms are powerful but not perfect. Understanding their limitations helps diagnose estimation problems.
WHERE YEAR(date_col) = 2024 can't use histograms on date_col. The optimizer sees the expression result, not the underlying column.WHERE state = 'CA' AND city = 'SF' selectivity is underestimated if most SF rows are in CA.The Correlation Problem in Detail:
Table: employees (100,000 rows)
Column: department (10 distinct values)
- Engineering: 50,000 rows (50%)
- Sales: 30,000 rows (30%)
- Other 8 depts: 20,000 rows (20%)
Column: skill (100 distinct values)
- 'python': 40,000 rows (40%) — but 95% are in Engineering!
- 'sales_training': 25,000 rows (25%) — 100% are in Sales
Query: WHERE department = 'Engineering' AND skill = 'python'
Optimizer's estimate (independence assumption):
sel = sel(Engineering) × sel(python)
sel = 0.50 × 0.40 = 0.20
estimated_rows = 100,000 × 0.20 = 20,000
Actual rows:
38,000 (95% of python users are in Engineering!)
Error: 1.9x underestimate
This correlation blindness is why some databases support extended statistics (PostgreSQL) or multi-column statistics (SQL Server) that capture cross-column dependencies.
Compare EXPLAIN output's estimated rows with actual rows (EXPLAIN ANALYZE in PostgreSQL, SET STATISTICS IO ON in SQL Server). If estimates are consistently off by 10x+, the histogram may need more buckets, a fresh ANALYZE, or extended statistics for correlated columns.
Modern database systems employ sophisticated techniques to overcome basic histogram limitations.
1. Extended Statistics for Correlations (PostgreSQL):
-- Create statistics tracking correlation between columns
CREATE STATISTICS ext_dept_skill (dependencies)
ON department, skill
FROM employees;
ANALYZE employees;
-- Now optimizer uses functional dependency:
-- P(skill | department) instead of assuming independence
2. Filtered Statistics (SQL Server):
-- Create histogram only for subset of data
CREATE STATISTICS stat_recent_orders
ON orders (amount)
WHERE order_date >= '2024-01-01'
WITH FULLSCAN;
-- Optimizer uses this for queries with matching filter
3. Expression Statistics:
-- PostgreSQL: Statistics on expressions
CREATE STATISTICS expr_stats
ON (EXTRACT(YEAR FROM order_date))
FROM orders;
-- Now WHERE EXTRACT(YEAR FROM order_date) = 2024 can use histogram
4. Adaptive Statistics and Machine Learning:
Some advanced commercial databases use query feedback to improve estimates:
1. Execute query with estimated cardinalities
2. Observe actual cardinalities during execution
3. Update statistical model with observed data
4. Future queries benefit from learned corrections
-- SQL Server Adaptive Query Processing:
-- Memory grant feedback
-- Batch mode adaptive joins
-- Interleaved execution
-- Oracle Adaptive Plans:
-- Runtime plan switching based on actual cardinality
5. Incremental Histogram Updates:
Instead of rebuilding histograms from scratch:
-- Track insertions/deletions by bucket
-- Adjust bucket frequencies incrementally
-- Rebalance when buckets become too uneven
-- This reduces the cost of keeping histograms fresh
-- But accumulates error over time (periodic full refresh needed)
Research is exploring machine learning models trained on query execution feedback to predict cardinalities. These 'learned' estimators can capture complex correlations that histograms miss, but face challenges with model staleness, training overhead, and interpretability.
Histograms transform range selectivity estimation from guesswork to principled calculation. They capture the shape of data distributions that simple min/max statistics cannot represent.
What's next:
Histograms provide the data; now we examine statistics maintenance—how and when to collect statistics, automatic maintenance strategies, and best practices for keeping statistics accurate as data evolves.
You now understand histogram types, construction algorithms, selectivity estimation mechanics, and common pitfalls. You can interpret histogram contents in system catalogs and configure bucket counts for improved accuracy. Next: keeping statistics fresh through maintenance strategies.