Loading content...
Table statistics tell us how big a table is—row counts, page counts, average row widths. But when the optimizer encounters a predicate like WHERE status = 'pending' or WHERE price > 1000, it needs to answer a fundamentally different question: How many rows will match this condition?
This is where column statistics become essential. Column statistics describe the distribution of values within individual columns—their range, their uniqueness, their null frequency, and their correlation with physical storage. Armed with this information, the optimizer can estimate selectivity: the fraction of rows that satisfy a given predicate.
By the end of this page, you will understand the key column-level statistics that databases maintain, how they're used to estimate predicate selectivity, the mathematical foundations of selectivity estimation, and common pitfalls that lead to poor estimates.
Database systems collect a rich set of statistics for each column. These statistics work together to enable accurate selectivity estimation for various types of predicates.
| Statistic | Definition | Selectivity Use |
|---|---|---|
| Distinct Value Count (NDV) | Number of unique values in the column | Equality predicates: sel(col = v) ≈ 1/NDV |
| Null Fraction | Proportion of NULL values | IS NULL: sel = null_frac; IS NOT NULL: sel = 1 - null_frac |
| Minimum Value | Smallest value in the column | Range predicates: col < min → sel = 0 |
| Maximum Value | Largest value in the column | Range predicates: col > max → sel = 0 |
| Average Width | Average byte size of values | Memory allocation, I/O estimation |
| Correlation | Physical vs. logical order correlation | Index scan cost estimation |
| Most Common Values (MCV) | List of frequent values with frequencies | Precise selectivity for skewed distributions |
| Histogram Bounds | Distribution buckets (see next page) | Range predicate selectivity |
Accurately estimating NDV (Number of Distinct Values) is computationally challenging. For a 100 million row table, exact computation requires remembering every value. Databases use probabilistic algorithms like HyperLogLog to estimate NDV with <3% error using only a few KB of memory.
The Number of Distinct Values (NDV), also called cardinality in some contexts, is perhaps the most influential column statistic. It directly affects selectivity estimates for equality predicates, join cardinality estimation, and GROUP BY result sizing.
How NDV is Used:
Consider the query:
SELECT * FROM orders WHERE status = 'pending'
To estimate how many rows match:
status column: NDV = 5 (e.g., pending, shipped, delivered, cancelled, returned)Representation in System Catalogs:
Databases store NDV in different ways:
12345678910111213141516171819202122
-- View NDV for all columnsSELECT attname AS column_name, n_distinct, CASE WHEN n_distinct > 0 THEN n_distinct::text WHEN n_distinct < 0 THEN (n_distinct * -1 * (SELECT reltuples FROM pg_class WHERE relname = 'orders'))::int::text || ' (estimated from ratio)' ELSE 'unknown' END AS interpretationFROM pg_statsWHERE tablename = 'orders'; -- Example output:-- column_name | n_distinct | interpretation-- -------------+------------+------------------------ order_id | -1 | 10000000 (unique)-- customer_id | -0.1 | 1000000 (estimated)-- status | 5 | 5 (exact count)-- order_date | 365 | 365 (exact count)The formula sel = 1/NDV assumes uniform distribution—that every distinct value appears equally often. This is rarely true! In most real data, some values are far more common than others ('pending' might be 50% of orders, while 'returned' is 1%). This is why Most Common Values (MCV) lists exist—to provide accurate selectivity for frequent values.
The null fraction represents the proportion of rows where the column value is NULL. This statistic is essential for:
Selectivity Calculations with Nulls:
Given: null_fraction = 0.15 (15% of rows have NULL)
NDV = 10 (among non-null values)
total_rows = 1,000,000
Query: WHERE col IS NULL
sel = null_fraction = 0.15
estimated_rows = 1,000,000 × 0.15 = 150,000
Query: WHERE col IS NOT NULL
sel = 1 - null_fraction = 0.85
estimated_rows = 1,000,000 × 0.85 = 850,000
Query: WHERE col = 'value'
-- Only non-null rows can match
sel = (1 - null_fraction) × (1 / NDV)
sel = 0.85 × 0.1 = 0.085
estimated_rows = 1,000,000 × 0.085 = 85,000
When a column has very high null fraction (>90%), consider whether it should be split into a separate table (for the 10% that have values) or redesigned. Not only does this clean up the data model, it can dramatically improve query optimization and storage efficiency.
The minimum and maximum column values define the value range, enabling the optimizer to estimate selectivity for range predicates using a uniform distribution assumption within the range.
Range Selectivity Estimation:
Given a column price with:
For predicate price > 500:
sel = (max_value - constant) / range_width
sel = (1000 - 500) / 990
sel ≈ 0.505 (50.5%)
For predicate price BETWEEN 200 AND 300:
sel = (high - low) / range_width
sel = (300 - 200) / 990
sel ≈ 0.101 (10.1%)
Edge Cases Handled by Min/Max:
Query: WHERE price < 5
Since 5 < min(10), sel = 0 (no rows can match)
Query: WHERE price > 2000
Since 2000 > max(1000), sel = 0 (no rows can match)
Query: WHERE price BETWEEN 0 AND 2000
Since range covers entire column, sel = 1.0
123456789101112131415161718192021222324
-- PostgreSQL stores min/max as binary in histogram_bounds-- The first and last elements are min and maxSELECT attname, histogram_bounds, -- For scalar types, bounds are text representation histogram_bounds[1] AS min_value, histogram_bounds[array_length(histogram_bounds, 1)] AS max_valueFROM pg_statsWHERE tablename = 'orders' AND attname = 'order_date'; -- For numeric columns, you can also examine:SELECT attname, stakind1, -- Type of statistic (1 = MCV, 2 = histogram) stavalues1, -- Actual values (MCVs) stanumbers1 -- Frequencies for MCVsFROM pg_statistic psJOIN pg_attribute pa ON ps.starelid = pa.attrelid AND ps.staattnum = pa.attnumJOIN pg_class pc ON pa.attrelid = pc.oidWHERE pc.relname = 'orders';Min/max values are sensitive to outliers. If one erroneous price value is -999999 (data entry error), the range calculation becomes wildly inaccurate. A price > 500 query would have selectivity near 100% instead of the correct ~50%. Data quality directly impacts optimizer accuracy.
Correlation measures how closely the physical ordering of rows on disk matches the logical ordering of column values. This statistic ranges from -1 to +1:
Correlation profoundly affects the cost of index range scans.
Why Correlation Matters for Index Scans:
Consider a range query SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31':
High Correlation (correlation ≈ 1):
Disk pages: [Jan orders] [Feb orders] [Mar orders] ...
Index scan: Retrieves consecutive pages → Sequential I/O
I/O pattern: Read pages 1-10 sequentially
Cost: 10 sequential page reads = very cheap
Low Correlation (correlation ≈ 0):
Disk pages: [Mixed: Jan, Mar, Jun] [Mixed: Feb, Aug, Jan] ...
Index scan: Retrieves scattered pages → Random I/O
I/O pattern: Read pages 1, 47, 103, 12, 89, ...
Cost: 10 random page reads = 10-100x more expensive
| Correlation | Index Scan Cost | Optimizer Likely Choice | Reason |
|---|---|---|---|
0.9 | Low (sequential I/O) | Index range scan | Clustered data minimizes page fetches |
| 0.5 - 0.9 | Moderate | Depends on selectivity | Partial clustering still beneficial |
| -0.5 - 0.5 | High (random I/O) | Seq scan for large ranges | Random I/O may exceed seq scan cost |
| < -0.9 | Low (reversed sequential) | Index scan (reverse) | Predictable access pattern |
1234567891011121314151617181920
-- PostgreSQL explicitly stores correlationSELECT attname AS column_name, correlation, CASE WHEN abs(correlation) > 0.9 THEN 'Highly correlated - index scan efficient' WHEN abs(correlation) > 0.5 THEN 'Moderately correlated' ELSE 'Low correlation - random I/O likely' END AS index_scan_efficiencyFROM pg_statsWHERE tablename = 'orders'ORDER BY abs(correlation) DESC; -- Example output:-- column_name | correlation | index_scan_efficiency-- -------------+-------------+---------------------------------- order_id | 0.998 | Highly correlated - efficient-- order_date | 0.876 | Moderately correlated-- customer_id | 0.023 | Low correlation - random I/O-- status | -0.004 | Low correlation - random I/OCorrelation degrades over time as INSERT/UPDATE operations scatter new data throughout the table. CLUSTER (PostgreSQL) or rebuilding clustered indexes (SQL Server) restores correlation. For time-series data, consider BRIN indexes that exploit natural correlation without requiring physical reorganization.
The Most Common Values (MCV) list addresses the fundamental limitation of the 1/NDV selectivity assumption: real data is rarely uniformly distributed. MCVs store the most frequent values along with their actual frequencies, enabling precise selectivity for these common values.
MCV Structure:
Column: status
MCV List:
Value | Frequency
-------------+-----------
'pending' | 0.45 (45% of non-null rows)
'shipped' | 0.30 (30%)
'delivered' | 0.15 (15%)
'cancelled' | 0.08 (8%)
'returned' | 0.02 (2%)
Total MCV coverage: 100% (all values are MCVs)
Selectivity with MCVs:
Query: WHERE status = 'pending'
Check MCV list → Found: frequency = 0.45
sel = 0.45 (exact, from MCV)
Query: WHERE status = 'processing'
Check MCV list → Not found
Remaining fraction = 1 - sum(MCV frequencies) = 1 - 1.0 = 0
sel = 0 (value doesn't exist)
Without MCVs (using 1/NDV):
sel = 1/5 = 0.20
Error for 'pending': 0.20 vs 0.45 (125% underestimate!)
12345678910111213141516171819202122232425262728
-- MCVs are in most_common_vals and most_common_freqsSELECT attname AS column_name, most_common_vals, most_common_freqs, n_distinctFROM pg_statsWHERE tablename = 'orders' AND attname = 'status'; -- Example output (reformatted):-- column_name: status-- most_common_vals: {pending,shipped,delivered,cancelled,returned}-- most_common_freqs: {0.45,0.30,0.15,0.08,0.02}-- n_distinct: 5 -- Unpacked view using unnest:SELECT v.value, f.frequency, ROUND(f.frequency * 100, 1) || '%' AS percentageFROM pg_stats,LATERAL unnest(most_common_vals::text[]) WITH ORDINALITY AS v(value, ord),LATERAL unnest(most_common_freqs) WITH ORDINALITY AS f(frequency, ord)WHERE tablename = 'orders' AND attname = 'status' AND v.ord = f.ordORDER BY f.frequency DESC;Databases limit the number of MCVs stored (typically 100-200). For high-cardinality columns like email addresses, MCVs provide no benefit. MCVs shine for categorical columns (status, country, category) with few distinct values but skewed distributions.
Let's consolidate the selectivity estimation formulas that use column statistics. These formulas are the mathematical foundation of predicate cost estimation.
| Predicate | Formula | Statistics Used |
|---|---|---|
col = constant | MCV[val] if exists, else (1 - MCV_sum) / (NDV - MCV_count) | MCV, NDV |
col <> constant | 1 - sel(col = constant) | MCV, NDV |
col IS NULL | null_fraction | Null fraction |
col IS NOT NULL | 1 - null_fraction | Null fraction |
col < constant | (constant - min) / (max - min) | Min, Max (or histogram) |
col > constant | (max - constant) / (max - min) | Min, Max (or histogram) |
col BETWEEN a AND b | (b - a) / (max - min) | Min, Max (or histogram) |
col LIKE 'prefix%' | 1 / NDV (approximation) or histogram bucket | NDV, histogram |
col1 = col2 (join) | 1 / MAX(NDV(col1), NDV(col2)) | NDV of both columns |
Compound Predicate Selectivity:
For compound predicates, optimizers assume independence between conditions:
sel(A AND B) = sel(A) × sel(B)
sel(A OR B) = sel(A) + sel(B) - sel(A) × sel(B)
sel(NOT A) = 1 - sel(A)
Example:
WHERE status = 'pending' AND order_date > '2024-01-01'
sel(status = 'pending') = 0.45 (from MCV)
sel(order_date > '2024-01-01') = 0.25 (from range calculation)
sel(combined) = 0.45 × 0.25 = 0.1125 (11.25%)
Independence is often false! If 'pending' orders are mostly from the last month (correlated with order_date), the true selectivity might be 30%, not 11%. Some systems support multi-column statistics to capture correlations, but most rely on the independence assumption and accept estimation error.
Column statistics transform raw metadata into actionable selectivity estimates. They're the mechanism by which the optimizer predicts filter results without executing queries.
What's next:
Column statistics work well for low-cardinality columns and exact value matches, but what about high-cardinality numeric columns with complex distributions? The next page explores histograms—the sophisticated statistical structures that capture full value distributions for precise range selectivity estimation.
You now understand how column-level statistics enable selectivity estimation. You can read statistics from system catalogs, understand their role in cost estimation, and recognize the limitations of uniform distribution assumptions. Next: histograms for capturing complex distributions.