Loading learning content...
All the statistics we've collected—table sizes, column distributions, histograms, correlations—converge on a single critical output: cardinality estimation. This is the optimizer's prediction of how many rows will flow through each node of the query plan.
Get the cardinality right, and the optimizer makes brilliant decisions. Get it wrong, and even a sophisticated optimizer produces disastrous plans. A query expected to return 100 rows might actually return 1 million. The optimizer chose a nested loop join (perfect for 100 rows, catastrophic for 1 million), and the query runs 10,000x slower than optimal.
By the end of this page, you will understand how cardinality estimation works, the sources of estimation error, how errors compound through join operations, techniques for diagnosing and mitigating estimation problems, and the fundamental limits of row estimation.
Cardinality in query optimization means the number of rows produced by an operator or flowing through a node in the query plan. Cardinality estimation is the process of predicting these counts without executing the query.
How Cardinality Estimation Works:
Input: Base table statistics + Query predicates
Output: Estimated row count at each plan node
Process:
1. Start with base table cardinality (from statistics)
2. Apply selectivity factors for each predicate:
estimated_rows = base_rows × selectivity
3. Propagate estimates up through the plan tree
4. Use join selectivity formulas for joining tables
Example:
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'pending'
AND o.order_date > '2024-01-01'
AND c.country = 'US';
Base tables:
orders: 10,000,000 rows
customers: 1,000,000 rows
Estimation steps:
1. orders after status filter:
sel(status='pending') = 0.20 (from MCV)
10,000,000 × 0.20 = 2,000,000 rows
2. orders after date filter:
sel(date > '2024-01-01') = 0.30 (from histogram)
Combined: 2,000,000 × 0.30 = 600,000 rows
(assumes independence)
3. customers after country filter:
sel(country='US') = 0.40 (from MCV)
1,000,000 × 0.40 = 400,000 rows
4. Join cardinality:
join_sel = 1 / max(NDV(o.customer_id), NDV(c.customer_id))
join_sel = 1 / 1,000,000 = 0.000001
600,000 × 400,000 × 0.000001 = 240,000 result rows
| Operator | Cardinality Formula | Key Dependencies |
|---|---|---|
| Table Scan | table_rows (from statistics) | Accurate row count |
| Selection (σ) | input_rows × selectivity | Predicate selectivity |
| Projection (π) | input_rows (unchanged) | None |
| Inner Join (⋈) | (rows_A × rows_B) × join_selectivity | Join key NDV |
| Left Outer Join | rows_A + (rows_A × rows_B × sel) - matched | Match rate estimation |
| GROUP BY | NDV of grouping columns | Multi-column NDV |
| DISTINCT | NDV of result columns | Multi-column NDV |
| UNION | rows_A + rows_B (may adjust for overlap) | Optional overlap estimation |
Every cost calculation depends on cardinality estimates. Join order, access method selection, memory allocation, parallelization decisions—all flow from cardinality. A 10x cardinality error can cause a 100x or 1000x performance error through cascading decision impacts.
Cardinality estimation is fundamentally approximate. Many sources of error are inherent to the statistical approach.
P(A AND B) = P(A) × P(B). Correlated columns violate this assumption dramatically.WHERE YEAR(date_col) = 2024 can't use date_col statistics. Optimizer often assumes default 1% selectivity.WHERE col LIKE '%pattern%' or user-defined functions have unknown selectivity.The Independence Problem in Detail:
-- Query with correlated columns
SELECT * FROM employees
WHERE department = 'Engineering'
AND skill = 'Python';
-- True correlation: 90% of Python developers are in Engineering
Optimizer's calculation (independence):
P(Engineering) = 0.30
P(Python) = 0.20
P(Engineering AND Python) = 0.30 × 0.20 = 0.06 (6%)
Actual probability: 0.18 (18%) — 3x underestimate!
-- The optimizer underestimates, potentially choosing:
-- - Index scan instead of table scan
-- - Nested loop instead of hash join
When a predicate wraps a column in a function, statistics become useless. WHERE UPPER(name) = 'JOHN' doesn't match the 'name' histogram. Many optimizers default to magic numbers like 1%, 5%, or 10% selectivity. These guesses are correct only by accident.
Individual estimation errors are problematic; error propagation through joins is catastrophic. Join cardinality estimation multiplies errors, often exponentially.
The Join Explosion:
Consider a 4-way join:
A ⋈ B ⋈ C ⋈ D
True cardinalities:
|A| = 100,000
|A ⋈ B| = 10,000
|A ⋈ B ⋈ C| = 5,000
|A ⋈ B ⋈ C ⋈ D| = 1,000
With 2x overestimate at each join:
|A ⋈ B|_est = 20,000 (2x off)
|A ⋈ B ⋈ C|_est = 20,000 (4x off)
|A ⋈ B ⋈ C ⋈ D|_est = 8,000 (8x off)
With 3x overestimate at each join:
|A ⋈ B|_est = 30,000 (3x off)
|A ⋈ B ⋈ C|_est = 45,000 (9x off)
|A ⋈ B ⋈ C ⋈ D|_est = 27,000 (27x off!)
With n joins and an average per-join error factor of e, final cardinality error can be e^n. A 2-table query with 3x error stays 3x wrong. A 5-table query becomes 243x wrong.
| Joins | 2x Error/Join | 3x Error/Join | 5x Error/Join |
|---|---|---|---|
| 1 | 2x | 3x | 5x |
| 2 | 4x | 9x | 25x |
| 3 | 8x | 27x | 125x |
| 4 | 16x | 81x | 625x |
| 5 | 32x | 243x | 3,125x |
Why Join Order Matters More Than You Think:
Given severe underestimate for A ⋈ B:
True: 1,000,000 rows
Estimated: 1,000 rows
Optimizer's plan (based on estimate):
Build tiny hash table for A ⋈ B (1,000 rows fits in 64KB)
Stream C through it
What actually happens:
Hash table needs 1,000,000 rows → 64MB
Exceeds memory → spills to disk
Performance: 100x slower than estimated
Better plan (if estimate was correct):
Build hash on C (smaller table)
Stream A ⋈ B through it
Much better memory utilization
The optimizer chooses plans based on estimates it believes to be correct. When estimates are wrong, it's not making 'bad' decisions—it's making optimal decisions for the wrong problem. This is why cardinality accuracy is existentially important.
When queries perform poorly, cardinality misestimation is often the culprit. Here's how to diagnose it systematically.
1234567891011121314151617181920212223242526272829303132333435363738
-- Step 1: Get estimated vs. actual cardinalitiesEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM orders oJOIN customers c ON o.customer_id = c.customer_idWHERE o.status = 'pending'; -- Look for discrepancies like:-- Hash Join (cost=... rows=1000) (actual rows=150000)-- ↑ This is the problem! 150x underestimate -- Step 2: Examine per-operator estimatesEXPLAIN (ANALYZE, VERBOSE, FORMAT JSON) -- JSON for programmatic analysisSELECT ...; -- Step 3: Check statistics on suspicious columnsSELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs, histogram_boundsFROM pg_statsWHERE tablename = 'orders' AND attname = 'status'; -- Step 4: Test selectivity in isolationEXPLAIN SELECT * FROM orders WHERE status = 'pending';-- Compare estimated rows to: SELECT COUNT(*) FROM orders WHERE status = 'pending'; -- Step 5: Check for missing extended statisticsSELECT * FROM pg_statistic_ext WHERE stxrelid = 'orders'::regclass; -- Step 6: Create extended statistics if correlation suspectedCREATE STATISTICS ext_orders_status_date (dependencies)ON status, order_date FROM orders;ANALYZE orders; -- Re-run EXPLAIN ANALYZE to see if estimate improvedIf actual rows differ from estimates by more than 10x at any operator, you have a cardinality problem worth investigating. Errors less than 10x rarely affect plan quality significantly due to cost model robustness.
Once you've diagnosed estimation problems, several mitigation strategies can improve accuracy.
1. Refresh Stale Statistics
-- The first thing to try: fresh ANALYZE
ANALYZE affected_table;
-- With higher precision if needed
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
2. Create Extended/Multi-Column Statistics
-- PostgreSQL: Correlation-aware statistics
CREATE STATISTICS ext_status_date (dependencies)
ON status, order_date FROM orders;
-- SQL Server: Multi-column statistics
CREATE STATISTICS stat_status_date
ON orders (status, order_date)
WITH FULLSCAN;
-- Oracle: Column groups
SELECT DBMS_STATS.CREATE_EXTENDED_STATS(
'SALES', 'ORDERS', '(STATUS, ORDER_DATE)'
) FROM DUAL;
3. Create Expression Statistics
-- PostgreSQL 14+: Statistics on expressions
CREATE STATISTICS expr_year
ON (EXTRACT(YEAR FROM order_date))
FROM orders;
ANALYZE orders;
-- Now WHERE EXTRACT(YEAR FROM order_date) = 2024
-- uses proper histogram instead of default 1%
Cardinality hints and forced plans should be used sparingly. They create maintenance burden, don't adapt to data changes, and mask underlying problems. Always prefer fixing statistics first.
Modern databases incorporate adaptive execution features that can correct cardinality errors during query execution, not just during planning.
SQL Server Adaptive Query Processing (2017+):
1. Memory Grant Feedback
- Hash joins that spill to tempdb remember the spill
- Next execution gets more memory automatically
- Gradually converges to optimal memory grant
2. Adaptive Joins
- Plan includes BOTH nested loop and hash join choices
- At runtime, chooses based on actual row count
- If <threshold rows: nested loop; else: hash join
3. Interleaved Execution
- Multi-statement table-valued functions
- Execute TVF first, use actual cardinality for rest of plan
Oracle Adaptive Plans (12c+):
1. Adaptive Plan Selection
- Plan compiled with "statistics collector" operators
- During first execution, actual cardinality is observed
- Plan switches sub-plans based on observed cardinality
2. SQL Plan Directives
- When estimates are wrong, directive is created
- Next execution uses dynamic sampling for that table/predicate
- Automatically improves estimates without manual intervention
3. Auto-Created Extended Statistics
- Optimizer tracks which column groups have correlated estimates
- Automatically creates column group statistics on problematic groups
PostgreSQL Adaptive Mechanisms:
PostgreSQL takes a more conservative approach:
1. No runtime plan switching
- Plan is fixed at compile time
- Philosophy: predictable execution over adaptive optimization
2. JIT Compilation Adapts to Row Count
- Large result sets trigger JIT compilation
- Based on actual execution, not estimates
3. Parallel Query Adaptation
- Workers can terminate early if estimates were wrong
- But initial worker count is based on estimates
General guidance for PostgreSQL:
- Focus on statistics accuracy upfront
- Use extended statistics for correlations
- Consider pg_hint_plan extension for problem queries
Adaptive query execution represents a paradigm shift: acknowledging that estimates will be wrong and building plans that can recover. New databases and versions increasingly incorporate these features. Stay current with your database's adaptive capabilities.
Despite our best efforts, cardinality estimation has fundamental theoretical limits that cannot be overcome with more statistics or better algorithms.
The Parameter Sensitivity Problem:
-- Same query, vastly different selectivity
SELECT * FROM orders WHERE status = @status;
@status = 'pending' → 40% of table (4M rows)
@status = 'archived' → 0.01% of table (1,000 rows)
Optimal plan for 'pending': Sequential scan + hash join
Optimal plan for 'archived': Index scan + nested loop join
The optimizer compiles ONE plan. Which parameter does it optimize for?
Solutions:
1. Parameter sniffing: Use first executed value (SQL Server, Oracle)
→ Works great until a different value runs and reuses the plan
2. Average selectivity: Estimate for 'typical' value
→ Often wrong for both extremes
3. Multiple plans: Compile different plans for different parameter ranges
→ Few databases support this well
4. OPTION (RECOMPILE): Compile fresh plan each time
→ Works but adds compilation overhead
Perfect estimation is impossible. The goal is 'good enough' estimation—close enough that the optimizer picks reasonable plans. A plan within 2x of optimal is usually acceptable. Only catastrophic misestimates (10x+) require intervention.
Cardinality estimation is where statistics meet decision-making. Every optimization choice—join order, access method, memory allocation—depends on the optimizer's row count predictions. Accurate estimation enables brilliant query plans; inaccurate estimation causes disasters.
Module Complete:
You've now explored the complete statistical foundation of query optimization:
This knowledge forms the diagnostic foundation for query performance troubleshooting. When a query is slow, you now know where to look first: Are the stats fresh? Are the estimates accurate? What's causing the discrepancy?
Congratulations! You've mastered Statistics and Histograms—the empirical foundation of cost-based query optimization. You can now diagnose cardinality problems, maintain statistics effectively, and understand both the power and limitations of estimation. This knowledge directly translates to solving real-world query performance problems.