Statistics And Histograms - Learning Module

Loading content...

0/252

Histograms

Capturing the Full Distribution

Min/max values and uniform distribution assumptions work adequately when data is evenly spread across ranges. But real-world data is rarely so cooperative. Consider a salary column where:

60% of employees earn between $40,000-$60,000
30% earn between $60,000-$100,000
10% are executives earning $100,000-$500,000

A simple uniform-range calculation for salary > 100000 would estimate 80% selectivity (based on the range 0-500K). The true selectivity is just 10%. Such errors lead to catastrophically wrong query plans.

Histograms solve this problem by dividing the value range into buckets that capture the actual data distribution, enabling precise selectivity estimates for any range predicate.

What You Will Learn

By the end of this page, you will understand different histogram types (equi-width, equi-depth, hybrid), how databases construct histograms from sample data, how histograms are used for range selectivity estimation, and when histograms provide significant value over simpler statistics.

What Is a Histogram?

In the context of database statistics, a histogram is a data structure that approximates the distribution of values in a column by dividing the value domain into buckets. Each bucket represents a range of values and stores summary statistics about that range.

Histogram Anatomy:

Column: salary (range $0 - $500,000)

Bucket | Range              | Row Count | Frequency
-------+--------------------+-----------+-----------
  1    | $0 - $40,000       |   50,000  |   5%
  2    | $40,000 - $60,000  |  600,000  |  60%
  3    | $60,000 - $100,000 |  300,000  |  30%
  4    | $100,000 - $500,000|   50,000  |   5%
                             ----------
                             1,000,000    100%

With this histogram, the optimizer can precisely estimate:

salary > 100000: ~5% (bucket 4 only)
salary BETWEEN 40000 AND 60000: ~60% (bucket 2)
salary < 60000: ~65% (buckets 1 + 2)

What Histograms Capture

•Bucket Boundaries — The value ranges that define each bucket
•Bucket Counts — Number of rows (or frequency) in each bucket
•Distinct Values per Bucket — How many unique values fall in each bucket (optional)
•Distribution Shape — Uniform, skewed, bimodal, or arbitrary distributions

Histograms vs. Full Distribution

A histogram is an approximation, not the exact distribution. With k buckets, a histogram can distinguish at most k+1 distinct ranges. Real columns may have millions of distinct values. The art of histogram design is choosing bucket boundaries that maximize estimation accuracy with limited storage.

Histogram Types: Equi-Width vs. Equi-Depth

The two fundamental histogram designs differ in how they partition the value domain into buckets.

Equi-Depth (Equal-Height) Histograms

Divide values so each bucket contains approximately the same number of rows. Bucket widths vary based on data density.

Construction:

rows_per_bucket = total_rows / num_buckets

Sort all values, then partition so each bucket has ~rows_per_bucket rows

Example (Salary with 4 buckets, 1M rows):

rows_per_bucket = 1,000,000 / 4 = 250,000

Bucket | Range                    | Count
-------+--------------------------+--------
  1    | $0 - $45,000             | 250,000
  2    | $45,000 - $55,000        | 250,000
  3    | $55,000 - $72,000        | 250,000
  4    | $72,000 - $500,000       | 250,000

Advantages:

Adapts to data distribution automatically
Dense regions get more buckets (higher precision)
More uniform estimation error across ranges

Disadvantages:

More complex construction (requires sorting)
Variable bucket widths complicate boundary lookup
Popular choice despite complexity (PostgreSQL, Oracle default)

Histogram Type Comparison
Aspect	Equi-Width	Equi-Depth	Hybrid/MCV
Construction Complexity	O(n)	O(n log n) sorting	O(n log n)
Storage per Bucket	Boundary + count	Boundary + count	Boundary + count + MCVs
Uniform Data Accuracy	Excellent	Good	Good
Skewed Data Accuracy	Poor	Good	Excellent
Range Query Support	Good	Excellent	Excellent
Point Query Support	Poor	Moderate	Excellent (MCVs)

Histogram Construction Algorithms

Understanding how databases build histograms clarifies their limitations and helps interpret their contents.

Equi-Depth Histogram Construction:

Algorithm: BuildEquiDepthHistogram(column, num_buckets)

1. SAMPLE: Collect sample of rows (e.g., 30,000 rows)
   sample = RandomSample(column, sample_size)

2. SORT: Sort sampled values
   sorted_values = Sort(sample)

3. PARTITION: Divide into equal-count buckets
   rows_per_bucket = sample_size / num_buckets
   
   FOR i = 0 TO num_buckets - 1:
       start_idx = i * rows_per_bucket
       end_idx = (i + 1) * rows_per_bucket - 1
       
       bucket[i].low = sorted_values[start_idx]
       bucket[i].high = sorted_values[end_idx]
       bucket[i].frequency = rows_per_bucket / sample_size

4. EXTRAPOLATE: Scale frequencies to full table
   FOR each bucket:
       bucket.estimated_rows = bucket.frequency * table_row_count

5. STORE: Write histogram to system catalog

Handling Edge Cases:

Repeated Values at Bucket Boundaries:

Sorted sample: [10, 10, 10, 10, 20, 20, 30, 40, 50, 60]
With 2 buckets, boundary falls on repeated 10s

Solution 1: Include all identical values in same bucket
  Bucket 1: [10, 10, 10, 10, 20]  — 5 values
  Bucket 2: [20, 30, 40, 50, 60]   — 5 values (20 split)

Solution 2: Popular value extraction
  MCV: 10 (frequency 0.4)
  Histogram: [20, 30], [40, 50, 60]

Sparse Regions:

Sorted sample: [1, 2, 3, 1000000, 1000001]
With 2 buckets:
  Bucket 1: [1, 2, 3]        — dense region
  Bucket 2: [1000000, 1000001] — sparse, long range
  
  Query: value = 500000
  Bucket 2 contains it, but actual frequency is 0!

This is why equi-depth outperforms equi-width for skewed data—the sparse region gets a single wide bucket instead of many empty ones.

Histogram Construction Internals
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- PostgreSQL builds histograms during ANALYZE
-- Control histogram resolution with statistics_target
 
-- Default: 100 histogram buckets
SHOW default_statistics_target;
 
-- Increase precision for specific column
ALTER TABLE orders 
    ALTER COLUMN amount 
    SET STATISTICS 500;  -- Up to 500 buckets
 
-- Force histogram rebuild
ANALYZE orders (amount);
 
-- View resulting histogram
SELECT 
    attname,
    array_length(histogram_bounds, 1) - 1 AS num_buckets,
    histogram_bounds[1] AS min_bound,
    histogram_bounds[array_length(histogram_bounds, 1)] AS max_bound
FROM pg_stats
WHERE tablename = 'orders' 
  AND attname = 'amount';
 
-- Detailed bucket examination
SELECT 
    bucket_num,
    histogram_bounds[bucket_num] AS bucket_start,
    histogram_bounds[bucket_num + 1] AS bucket_end,
    1.0 / (array_length(histogram_bounds, 1) - 1) AS fraction_per_bucket
FROM pg_stats,
     generate_series(1, array_length(histogram_bounds, 1) - 1) AS bucket_num
WHERE tablename = 'orders' AND attname = 'amount';

Selectivity Estimation Using Histograms

Histograms transform range predicate selectivity from crude estimation to precise calculation. Here's how the optimizer uses histograms:

Algorithm: Histogram Range Selectivity

Function: EstimateRangeSelectivity(histogram, low, high)

1. LOCATE BUCKETS: Find buckets overlapping [low, high]
   start_bucket = FindBucket(low)
   end_bucket = FindBucket(high)

2. FULL BUCKETS: Sum frequency of fully contained buckets
   full_selectivity = SUM(bucket.frequency) 
                      for buckets between start_bucket and end_bucket

3. PARTIAL BUCKETS: Interpolate for boundary buckets
   
   For start_bucket (if partially overlapped):
     overlap_fraction = (bucket.high - low) / bucket.width
     start_contrib = bucket.frequency * overlap_fraction
   
   For end_bucket (if partially overlapped):
     overlap_fraction = (high - bucket.low) / bucket.width
     end_contrib = bucket.frequency * overlap_fraction

4. COMBINE: Return total selectivity
   RETURN start_contrib + full_selectivity + end_contrib

Worked Example:

Histogram for 'price' column (equi-depth, 4 buckets, 100K rows each):

Bucket | Range        | Count  | Frequency
-------+--------------+--------+-----------
  1    | $0 - $25     | 100,000|   25%
  2    | $25 - $50    | 100,000|   25%
  3    | $50 - $100   | 100,000|   25%
  4    | $100 - $500  | 100,000|   25%

Query: WHERE price BETWEEN 30 AND 120

1. Locate buckets:
   - $30 falls in bucket 2 (range $25-$50)
   - $120 falls in bucket 4 (range $100-$500)

2. Full buckets: 
   - Bucket 3 ($50-$100) is fully contained: 25%

3. Partial buckets:
   - Bucket 2: overlap = ($50 - $30) / ($50 - $25) = 20/25 = 80%
     contribution = 25% × 0.80 = 20%
   - Bucket 4: overlap = ($120 - $100) / ($500 - $100) = 20/400 = 5%
     contribution = 25% × 0.05 = 1.25%

4. Total selectivity: 20% + 25% + 1.25% = 46.25%
   Estimated rows: 400,000 × 0.4625 = 185,000

The Uniform-Within-Bucket Assumption

Histogram interpolation assumes uniform distribution within each bucket. If bucket 4 contains mostly values near $100 (not evenly spread to $500), the 1.25% estimate is wrong. More buckets reduce this error at the cost of more storage. This is the fundamental accuracy/space tradeoff.

Histogram Quality and Bucket Count

The number of histogram buckets directly impacts estimation accuracy. More buckets capture finer distribution details but consume more storage and increase maintenance overhead.

Histogram Bucket Count Trade-offs
Bucket Count	Storage	Accuracy	Best For
10-20	~200 bytes	Coarse	Low-cardinality, uniform data
50-100	~1 KB	Good	Most OLTP columns (default)
200-256	~3 KB	Very good	Skewed high-cardinality columns
500-1000	~10 KB	Excellent	Critical columns, complex distributions

Estimation Error Analysis:

For an equi-depth histogram with k buckets:

Maximum estimation error for any range query:
  error ≤ 2/k (two bucket-boundary approximations)

With 100 buckets: max error ≤ 2% per bucket boundary
With 200 buckets: max error ≤ 1% per bucket boundary

When to Increase Bucket Count:

Query performance is sensitive to accurate cardinality estimates for this column
Data is highly skewed with multiple density regions
Range queries dominate the workload on this column
Join cardinality depends on this column and is frequently misestimated

Signs Histogram Is Adequate

•Actual vs. estimated rows differ by <2x
•Query plans remain stable over time
•No parameter sniffing issues
•Execution times match predictions

Signs More Buckets Needed

•Actual rows 10x+ different from estimated
•Plans flip between fast and slow for similar queries
•Hash joins spill unexpectedly
•Sorts request wrong memory grants

Adjusting Histogram Resolution
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Check current statistics target
SELECT 
    attname,
    attstattarget  -- -1 means use default
FROM pg_attribute a
JOIN pg_class c ON a.attrelid = c.oid
WHERE c.relname = 'orders';
 
-- Increase for specific column
ALTER TABLE orders 
    ALTER COLUMN amount SET STATISTICS 500;
 
-- After changing, must re-analyze
ANALYZE orders (amount);
 
-- Verify bucket count increased
SELECT 
    attname,
    array_length(histogram_bounds, 1) AS num_bounds,
    array_length(most_common_vals, 1) AS num_mcvs
FROM pg_stats
WHERE tablename = 'orders';

Histogram Limitations and Pitfalls

Histograms are powerful but not perfect. Understanding their limitations helps diagnose estimation problems.

Common Histogram Pitfalls

•Sampling Errors — Histograms built from samples may miss rare value clusters. A 1% sample of 100M rows sees 1M values—rare patterns affecting 10,000 rows might not appear in the sample.
•Stale Histograms — Data distribution changes over time. A histogram built last month may poorly represent this month's data after seasonal shifts or business changes.
•Bound Misalignment — Query constants that fall exactly on bucket boundaries create ambiguity. Is value X in this bucket or the next? Different systems handle this differently.
•Function Wrapping — Predicates like WHERE YEAR(date_col) = 2024 can't use histograms on date_col. The optimizer sees the expression result, not the underlying column.
•Correlation Blindness — Histograms on individual columns don't capture multi-column correlations. WHERE state = 'CA' AND city = 'SF' selectivity is underestimated if most SF rows are in CA.
•Histogram Type Mismatch — Using frequency histogram on high-cardinality column wastes buckets on nearly unique values. Equi-depth is almost always better for high NDV.

The Correlation Problem in Detail:

Table: employees (100,000 rows)

Column: department (10 distinct values)
  - Engineering: 50,000 rows (50%)
  - Sales: 30,000 rows (30%)
  - Other 8 depts: 20,000 rows (20%)

Column: skill (100 distinct values)
  - 'python': 40,000 rows (40%) — but 95% are in Engineering!
  - 'sales_training': 25,000 rows (25%) — 100% are in Sales

Query: WHERE department = 'Engineering' AND skill = 'python'

Optimizer's estimate (independence assumption):
  sel = sel(Engineering) × sel(python)
  sel = 0.50 × 0.40 = 0.20
  estimated_rows = 100,000 × 0.20 = 20,000

Actual rows:
  38,000 (95% of python users are in Engineering!)

Error: 1.9x underestimate

This correlation blindness is why some databases support extended statistics (PostgreSQL) or multi-column statistics (SQL Server) that capture cross-column dependencies.

Diagnosing Histogram Issues

Compare EXPLAIN output's estimated rows with actual rows (EXPLAIN ANALYZE in PostgreSQL, SET STATISTICS IO ON in SQL Server). If estimates are consistently off by 10x+, the histogram may need more buckets, a fresh ANALYZE, or extended statistics for correlated columns.

Advanced Histogram Techniques

Modern database systems employ sophisticated techniques to overcome basic histogram limitations.

1. Extended Statistics for Correlations (PostgreSQL):

-- Create statistics tracking correlation between columns
CREATE STATISTICS ext_dept_skill (dependencies)
ON department, skill
FROM employees;

ANALYZE employees;

-- Now optimizer uses functional dependency:
-- P(skill | department) instead of assuming independence

2. Filtered Statistics (SQL Server):

-- Create histogram only for subset of data
CREATE STATISTICS stat_recent_orders
ON orders (amount)
WHERE order_date >= '2024-01-01'
WITH FULLSCAN;

-- Optimizer uses this for queries with matching filter

3. Expression Statistics:

-- PostgreSQL: Statistics on expressions
CREATE STATISTICS expr_stats 
ON (EXTRACT(YEAR FROM order_date))
FROM orders;

-- Now WHERE EXTRACT(YEAR FROM order_date) = 2024 can use histogram

4. Adaptive Statistics and Machine Learning:

Some advanced commercial databases use query feedback to improve estimates:

1. Execute query with estimated cardinalities
2. Observe actual cardinalities during execution
3. Update statistical model with observed data
4. Future queries benefit from learned corrections

-- SQL Server Adaptive Query Processing:
-- Memory grant feedback
-- Batch mode adaptive joins
-- Interleaved execution

-- Oracle Adaptive Plans:
-- Runtime plan switching based on actual cardinality

5. Incremental Histogram Updates:

Instead of rebuilding histograms from scratch:

-- Track insertions/deletions by bucket
-- Adjust bucket frequencies incrementally
-- Rebalance when buckets become too uneven

-- This reduces the cost of keeping histograms fresh
-- But accumulates error over time (periodic full refresh needed)

The Future: Learned Cardinality Estimation

Research is exploring machine learning models trained on query execution feedback to predict cardinalities. These 'learned' estimators can capture complex correlations that histograms miss, but face challenges with model staleness, training overhead, and interpretability.

Summary: Histogram Mastery

Histograms transform range selectivity estimation from guesswork to principled calculation. They capture the shape of data distributions that simple min/max statistics cannot represent.

Key Takeaways

•Histograms approximate value distributions by partitioning ranges into buckets with associated frequencies.
•Equi-depth histograms adapt to data skew by placing more boundaries in dense regions, providing superior accuracy for non-uniform data.
•Bucket count trades accuracy for storage — more buckets reduce within-bucket approximation error but consume more catalog space.
•Selectivity estimation interpolates within buckets assuming uniform distribution; this assumption is the primary source of error.
•Stale histograms cause plan regression — regular ANALYZE ensures histograms reflect current data distributions.
•Advanced techniques (extended statistics, filtered statistics, expression statistics) address correlation blindness and other limitations.

What's next:

Histograms provide the data; now we examine statistics maintenance—how and when to collect statistics, automatic maintenance strategies, and best practices for keeping statistics accurate as data evolves.

Page Complete

You now understand histogram types, construction algorithms, selectivity estimation mechanics, and common pitfalls. You can interpret histogram contents in system catalogs and configure bucket counts for improved accuracy. Next: keeping statistics fresh through maintenance strategies.