Statistics And Histograms - Learning Module

Loading content...

0/241

Column Statistics

Beyond Table-Level: Understanding Column Distributions

Table statistics tell us how big a table is—row counts, page counts, average row widths. But when the optimizer encounters a predicate like WHERE status = 'pending' or WHERE price > 1000, it needs to answer a fundamentally different question: How many rows will match this condition?

This is where column statistics become essential. Column statistics describe the distribution of values within individual columns—their range, their uniqueness, their null frequency, and their correlation with physical storage. Armed with this information, the optimizer can estimate selectivity: the fraction of rows that satisfy a given predicate.

What You Will Learn

By the end of this page, you will understand the key column-level statistics that databases maintain, how they're used to estimate predicate selectivity, the mathematical foundations of selectivity estimation, and common pitfalls that lead to poor estimates.

Core Column Statistics

Database systems collect a rich set of statistics for each column. These statistics work together to enable accurate selectivity estimation for various types of predicates.

Essential Column-Level Statistics
Statistic	Definition	Selectivity Use
Distinct Value Count (NDV)	Number of unique values in the column	Equality predicates: sel(col = v) ≈ 1/NDV
Null Fraction	Proportion of NULL values	IS NULL: sel = null_frac; IS NOT NULL: sel = 1 - null_frac
Minimum Value	Smallest value in the column	Range predicates: col < min → sel = 0
Maximum Value	Largest value in the column	Range predicates: col > max → sel = 0
Average Width	Average byte size of values	Memory allocation, I/O estimation
Correlation	Physical vs. logical order correlation	Index scan cost estimation
Most Common Values (MCV)	List of frequent values with frequencies	Precise selectivity for skewed distributions
Histogram Bounds	Distribution buckets (see next page)	Range predicate selectivity

The NDV Challenge

Accurately estimating NDV (Number of Distinct Values) is computationally challenging. For a 100 million row table, exact computation requires remembering every value. Databases use probabilistic algorithms like HyperLogLog to estimate NDV with <3% error using only a few KB of memory.

Distinct Value Count (NDV) in Depth

The Number of Distinct Values (NDV), also called cardinality in some contexts, is perhaps the most influential column statistic. It directly affects selectivity estimates for equality predicates, join cardinality estimation, and GROUP BY result sizing.

How NDV is Used:

Consider the query:

SELECT * FROM orders WHERE status = 'pending'

To estimate how many rows match:

Look up NDV for status column: NDV = 5 (e.g., pending, shipped, delivered, cancelled, returned)
Apply uniform distribution assumption: each value appears with equal frequency
Calculate selectivity: sel = 1 / NDV = 1/5 = 0.2 = 20%
Multiply by total rows: estimated_rows = 10,000,000 × 0.2 = 2,000,000

Representation in System Catalogs:

Databases store NDV in different ways:

Positive value (n): Exactly n distinct values
Negative value (-f): Fraction of table that is distinct (e.g., -0.8 means 80% unique)
Ratio format: Some systems store NDV/row_count directly

Viewing NDV Statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- View NDV for all columns
SELECT 
    attname AS column_name,
    n_distinct,
    CASE 
        WHEN n_distinct > 0 THEN n_distinct::text
        WHEN n_distinct < 0 THEN 
            (n_distinct * -1 * (SELECT reltuples FROM pg_class 
                                WHERE relname = 'orders'))::int::text 
                                || ' (estimated from ratio)'
        ELSE 'unknown'
    END AS interpretation
FROM pg_stats
WHERE tablename = 'orders';
 
-- Example output:
-- column_name  | n_distinct | interpretation
-- -------------+------------+----------------------
-- order_id     | -1         | 10000000 (unique)
-- customer_id  | -0.1       | 1000000 (estimated)
-- status       | 5          | 5 (exact count)
-- order_date   | 365        | 365 (exact count)

The Uniform Distribution Fallacy

The formula sel = 1/NDV assumes uniform distribution—that every distinct value appears equally often. This is rarely true! In most real data, some values are far more common than others ('pending' might be 50% of orders, while 'returned' is 1%). This is why Most Common Values (MCV) lists exist—to provide accurate selectivity for frequent values.

Null Fraction and Nullable Columns

The null fraction represents the proportion of rows where the column value is NULL. This statistic is essential for:

IS NULL / IS NOT NULL predicates — Direct selectivity calculation
Equality predicates — NULLs never match in equality comparisons
Join estimation — NULLs don't join with anything
Aggregation — COUNT(*) vs COUNT(col) differs by null count

Selectivity Calculations with Nulls:

Given: null_fraction = 0.15 (15% of rows have NULL)
       NDV = 10 (among non-null values)
       total_rows = 1,000,000

Query: WHERE col IS NULL
  sel = null_fraction = 0.15
  estimated_rows = 1,000,000 × 0.15 = 150,000

Query: WHERE col IS NOT NULL
  sel = 1 - null_fraction = 0.85
  estimated_rows = 1,000,000 × 0.85 = 850,000

Query: WHERE col = 'value'
  -- Only non-null rows can match
  sel = (1 - null_fraction) × (1 / NDV)
  sel = 0.85 × 0.1 = 0.085
  estimated_rows = 1,000,000 × 0.085 = 85,000

NULL Handling Nuances

•NULL ≠ NULL (NULL = NULL → UNKNOWN, not TRUE)
•Aggregations ignore NULLs (except COUNT(*))
•NULLs sort first or last depending on DBMS/config
•UNIQUE constraints allow multiple NULLs (usually)
•Indexes may or may not include NULL entries

Optimizer Implications

•High null fraction → index less useful for scans
•Outer joins create NULLs affecting downstream estimates
•NOT NULL constraints eliminate null fraction tracking
•Partial indexes on non-null values can be smaller
•Statistics on nullable FK columns affect join estimates

Design Implication

When a column has very high null fraction (>90%), consider whether it should be split into a separate table (for the 10% that have values) or redesigned. Not only does this clean up the data model, it can dramatically improve query optimization and storage efficiency.

Minimum, Maximum, and Range Statistics

The minimum and maximum column values define the value range, enabling the optimizer to estimate selectivity for range predicates using a uniform distribution assumption within the range.

Range Selectivity Estimation:

Given a column price with:

min_value = 10
max_value = 1000
Range width = max - min = 990

For predicate price > 500:

sel = (max_value - constant) / range_width
sel = (1000 - 500) / 990
sel ≈ 0.505 (50.5%)

For predicate price BETWEEN 200 AND 300:

sel = (high - low) / range_width
sel = (300 - 200) / 990
sel ≈ 0.101 (10.1%)

Edge Cases Handled by Min/Max:

Query: WHERE price < 5
  Since 5 < min(10), sel = 0 (no rows can match)

Query: WHERE price > 2000
  Since 2000 > max(1000), sel = 0 (no rows can match)
  
Query: WHERE price BETWEEN 0 AND 2000
  Since range covers entire column, sel = 1.0

Viewing Min/Max Statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL stores min/max as binary in histogram_bounds
-- The first and last elements are min and max
SELECT 
    attname,
    histogram_bounds,
    -- For scalar types, bounds are text representation
    histogram_bounds[1] AS min_value,
    histogram_bounds[array_length(histogram_bounds, 1)] AS max_value
FROM pg_stats
WHERE tablename = 'orders' 
  AND attname = 'order_date';
  
-- For numeric columns, you can also examine:
SELECT 
    attname,
    stakind1,  -- Type of statistic (1 = MCV, 2 = histogram)
    stavalues1, -- Actual values (MCVs)
    stanumbers1 -- Frequencies for MCVs
FROM pg_statistic ps
JOIN pg_attribute pa 
    ON ps.starelid = pa.attrelid AND ps.staattnum = pa.attnum
JOIN pg_class pc 
    ON pa.attrelid = pc.oid
WHERE pc.relname = 'orders';

Outliers Distort Range Estimates

Min/max values are sensitive to outliers. If one erroneous price value is -999999 (data entry error), the range calculation becomes wildly inaccurate. A price > 500 query would have selectivity near 100% instead of the correct ~50%. Data quality directly impacts optimizer accuracy.

Correlation: Physical vs. Logical Order

Correlation measures how closely the physical ordering of rows on disk matches the logical ordering of column values. This statistic ranges from -1 to +1:

+1: Perfect positive correlation (low values at start of table, high values at end)
-1: Perfect negative correlation (high values at start, low values at end)
0: No correlation (values randomly distributed throughout table)

Correlation profoundly affects the cost of index range scans.

Why Correlation Matters for Index Scans:

Consider a range query SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31':

High Correlation (correlation ≈ 1):

Disk pages:     [Jan orders] [Feb orders] [Mar orders] ...
Index scan:     Retrieves consecutive pages → Sequential I/O
I/O pattern:    Read pages 1-10 sequentially
Cost:           10 sequential page reads = very cheap

Low Correlation (correlation ≈ 0):

Disk pages:     [Mixed: Jan, Mar, Jun] [Mixed: Feb, Aug, Jan] ...
Index scan:     Retrieves scattered pages → Random I/O
I/O pattern:    Read pages 1, 47, 103, 12, 89, ...
Cost:           10 random page reads = 10-100x more expensive

Correlation Impact on Access Method Choice
Correlation	Index Scan Cost	Optimizer Likely Choice	Reason
0.9	Low (sequential I/O)	Index range scan	Clustered data minimizes page fetches
0.5 - 0.9	Moderate	Depends on selectivity	Partial clustering still beneficial
-0.5 - 0.5	High (random I/O)	Seq scan for large ranges	Random I/O may exceed seq scan cost
< -0.9	Low (reversed sequential)	Index scan (reverse)	Predictable access pattern

Viewing Correlation Statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL explicitly stores correlation
SELECT 
    attname AS column_name,
    correlation,
    CASE 
        WHEN abs(correlation) > 0.9 THEN 'Highly correlated - index scan efficient'
        WHEN abs(correlation) > 0.5 THEN 'Moderately correlated'
        ELSE 'Low correlation - random I/O likely'
    END AS index_scan_efficiency
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY abs(correlation) DESC;
 
-- Example output:
-- column_name  | correlation | index_scan_efficiency
-- -------------+-------------+--------------------------------
-- order_id     | 0.998       | Highly correlated - efficient
-- order_date   | 0.876       | Moderately correlated
-- customer_id  | 0.023       | Low correlation - random I/O
-- status       | -0.004      | Low correlation - random I/O

Maintaining Correlation

Correlation degrades over time as INSERT/UPDATE operations scatter new data throughout the table. CLUSTER (PostgreSQL) or rebuilding clustered indexes (SQL Server) restores correlation. For time-series data, consider BRIN indexes that exploit natural correlation without requiring physical reorganization.

Most Common Values (MCV) Lists

The Most Common Values (MCV) list addresses the fundamental limitation of the 1/NDV selectivity assumption: real data is rarely uniformly distributed. MCVs store the most frequent values along with their actual frequencies, enabling precise selectivity for these common values.

MCV Structure:

Column: status
MCV List:
  Value        | Frequency
  -------------+-----------
  'pending'    | 0.45      (45% of non-null rows)
  'shipped'    | 0.30      (30%)
  'delivered'  | 0.15      (15%)
  'cancelled'  | 0.08      (8%)
  'returned'   | 0.02      (2%)
  
Total MCV coverage: 100% (all values are MCVs)

Selectivity with MCVs:

Query: WHERE status = 'pending'
  Check MCV list → Found: frequency = 0.45
  sel = 0.45 (exact, from MCV)
  
Query: WHERE status = 'processing'
  Check MCV list → Not found
  Remaining fraction = 1 - sum(MCV frequencies) = 1 - 1.0 = 0
  sel = 0 (value doesn't exist)
  
Without MCVs (using 1/NDV):
  sel = 1/5 = 0.20
  Error for 'pending': 0.20 vs 0.45 (125% underestimate!)

Viewing MCV Statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- MCVs are in most_common_vals and most_common_freqs
SELECT 
    attname AS column_name,
    most_common_vals,
    most_common_freqs,
    n_distinct
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'status';
 
-- Example output (reformatted):
-- column_name: status
-- most_common_vals: {pending,shipped,delivered,cancelled,returned}
-- most_common_freqs: {0.45,0.30,0.15,0.08,0.02}
-- n_distinct: 5
 
-- Unpacked view using unnest:
SELECT 
    v.value,
    f.frequency,
    ROUND(f.frequency * 100, 1) || '%' AS percentage
FROM pg_stats,
LATERAL unnest(most_common_vals::text[]) WITH ORDINALITY AS v(value, ord),
LATERAL unnest(most_common_freqs) WITH ORDINALITY AS f(frequency, ord)
WHERE tablename = 'orders'
  AND attname = 'status'
  AND v.ord = f.ord
ORDER BY f.frequency DESC;

MCV Count Limits

Databases limit the number of MCVs stored (typically 100-200). For high-cardinality columns like email addresses, MCVs provide no benefit. MCVs shine for categorical columns (status, country, category) with few distinct values but skewed distributions.

Selectivity Estimation Formulas

Let's consolidate the selectivity estimation formulas that use column statistics. These formulas are the mathematical foundation of predicate cost estimation.

Selectivity Formulas by Predicate Type
Predicate	Formula	Statistics Used
`col = constant`	MCV[val] if exists, else (1 - MCV_sum) / (NDV - MCV_count)	MCV, NDV
`col <> constant`	1 - sel(col = constant)	MCV, NDV
`col IS NULL`	null_fraction	Null fraction
`col IS NOT NULL`	1 - null_fraction	Null fraction
`col < constant`	(constant - min) / (max - min)	Min, Max (or histogram)
`col > constant`	(max - constant) / (max - min)	Min, Max (or histogram)
`col BETWEEN a AND b`	(b - a) / (max - min)	Min, Max (or histogram)
`col LIKE 'prefix%'`	1 / NDV (approximation) or histogram bucket	NDV, histogram
`col1 = col2` (join)	1 / MAX(NDV(col1), NDV(col2))	NDV of both columns

Compound Predicate Selectivity:

For compound predicates, optimizers assume independence between conditions:

sel(A AND B) = sel(A) × sel(B)
sel(A OR B)  = sel(A) + sel(B) - sel(A) × sel(B)
sel(NOT A)   = 1 - sel(A)

Example:

WHERE status = 'pending' AND order_date > '2024-01-01'

sel(status = 'pending') = 0.45 (from MCV)
sel(order_date > '2024-01-01') = 0.25 (from range calculation)

sel(combined) = 0.45 × 0.25 = 0.1125 (11.25%)

The Independence Assumption Problem

Independence is often false! If 'pending' orders are mostly from the last month (correlated with order_date), the true selectivity might be 30%, not 11%. Some systems support multi-column statistics to capture correlations, but most rely on the independence assumption and accept estimation error.

Summary: Column Statistics Mastery

Column statistics transform raw metadata into actionable selectivity estimates. They're the mechanism by which the optimizer predicts filter results without executing queries.

Key Takeaways

•NDV (distinct value count) enables equality selectivity estimation via the 1/NDV formula, though it assumes uniform distribution.
•Null fraction is essential for IS NULL predicates and must be factored into all other selectivity calculations.
•Min/Max values define column ranges for estimating less-than, greater-than, and BETWEEN predicates.
•Correlation measures physical vs. logical ordering and directly impacts index scan cost estimation.
•MCVs provide exact frequencies for common values, correcting the uniform distribution assumption for skewed data.
•Compound predicates use independence assumption: sel(A AND B) = sel(A) × sel(B).

What's next:

Column statistics work well for low-cardinality columns and exact value matches, but what about high-cardinality numeric columns with complex distributions? The next page explores histograms—the sophisticated statistical structures that capture full value distributions for precise range selectivity estimation.

Page Complete

You now understand how column-level statistics enable selectivity estimation. You can read statistics from system catalogs, understand their role in cost estimation, and recognize the limitations of uniform distribution assumptions. Next: histograms for capturing complex distributions.