Cost Models - Learning Module

Loading content...

0/241

Cardinality Estimation

The Most Important Number in Query Optimization

If you could master only one aspect of query optimization, it should be cardinality estimation. Every decision the optimizer makes—join algorithms, access methods, parallelism, memory allocation—depends critically on predicting how many rows each operation will process and produce.

Cardinality estimation answers the question: "How many rows will this operation return?" For a simple filter, this means predicting how many rows satisfy the WHERE clause. For a join, it means predicting how many pairs will match the join condition. For a GROUP BY, it means predicting how many distinct groups will form.

When cardinality estimation is accurate, the optimizer makes brilliant choices. When it's wrong—and it often is—the results can be catastrophic. A 100× overestimate might cause allocation of vast amounts of memory that won't be used. A 100× underestimate might trigger a nested loop join that takes hours instead of the hash join that would take seconds.

This page develops deep expertise in cardinality estimation: how it works, why it fails, and how to fix it.

What You Will Learn

By the end of this page, you will understand how selectivity factors combine to estimate filter cardinality, join cardinality estimation formulas and their assumptions, the major sources of estimation error and how they compound, modern approaches including adaptive estimation and machine learning, and practical diagnosis and resolution of cardinality problems.

Selectivity Estimation Fundamentals

Selectivity is the fraction of rows that satisfy a predicate. It ranges from 0.0 (no rows match) to 1.0 (all rows match). Cardinality estimation for filters is:

Output cardinality = Input cardinality × Selectivity

For a table with 1,000,000 rows and a predicate with selectivity 0.05 (5%):

Output cardinality = 1,000,000 × 0.05 = 50,000 rows

The challenge is determining selectivity accurately. Different predicate types use different estimation techniques:

Selectivity Estimation by Predicate Type
Predicate Type	Example	Estimation Method
Equality (indexed)	id = 42	1/n_distinct if uniform; MCV check first
Equality (MCV match)	status = 'active'	Lookup frequency in MCV list
Equality (non-MCV)	city = 'Smallville'	(1 - sum(mcv_freqs)) / (n_distinct - mcv_count)
NULL check	IS NULL / IS NOT NULL	null_frac or (1 - null_frac)
Range (numeric)	age BETWEEN 20 AND 30	Histogram bucket interpolation
Range (open-ended)	price > 1000	Sum histogram bucket fractions beyond bound
LIKE (prefix)	LIKE 'John%'	Histogram range on prefix bounds
LIKE (pattern)	LIKE '%@gmail%'	Default selectivity (~0.005 to 0.02)
IN list	IN (1, 2, 3)	Sum of individual equality selectivities
NOT condition	status != 'deleted'	1 - selectivity(status = 'deleted')

selectivity_calculation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
=============================================================
SELECTIVITY CALCULATION WALKTHROUGH
=============================================================
 
Table: orders (1,000,000 rows)
Column: status (MCV list: {pending:0.35, processing:0.25, shipped:0.20, 
                           delivered:0.15, cancelled:0.05})
 
Case 1: status = 'pending' (MCV match)
  Selectivity = 0.35 (directly from MCV frequency)
  Estimated rows = 1,000,000 × 0.35 = 350,000
 
Case 2: status = 'archived' (not in MCV list)
  MCV covers: 0.35 + 0.25 + 0.20 + 0.15 + 0.05 = 1.00 (all values!)
  No non-MCV values exist
  Selectivity = 0.0 (or near-zero default)
  This correctly identifies 'archived' as non-existent
 
Case 3: status != 'cancelled'
  Selectivity = 1 - 0.05 = 0.95
  Estimated rows = 1,000,000 × 0.95 = 950,000
 
=============================================================
Histogram-based range estimation
=============================================================
 
Column: total_amount
Histogram: 100 equal-height buckets spanning [0, 10000]
Each bucket represents 1% of rows (after MCV exclusion)
MCV values removed: ~5% of rows in MCVs
 
Query: WHERE total_amount BETWEEN 500 AND 2000
 
Histogram bounds at 500: between bucket 5 and 6
Histogram bounds at 2000: between bucket 20 and 21
 
Full buckets covered: 6, 7, 8, ..., 20 = 15 buckets
Partial bucket at start: ~50% of bucket 5
Partial bucket at end: ~20% of bucket 21
 
Non-MCV selectivity: (14 + 0.5 + 0.2) × 0.01 = 0.147 (14.7%)
Apply to non-MCV fraction: 0.147 × 0.95 = 0.13965
Add any MCVs in range: assume 2% in this range
Final selectivity: 0.14 + 0.02 = 0.16
 
Estimated rows: 1,000,000 × 0.16 = 160,000

The Default Selectivity Problem

When the optimizer can't compute selectivity (complex expressions, missing statistics, UDF predicates), it falls back to defaults: ~0.33 for range predicates, ~0.005 for equality, ~0.25 for inequality. These guesses are often wildly wrong, causing significant estimation errors for complex queries.

Combining Selectivities for Complex Predicates

Real queries have multiple predicates combined with AND, OR, and NOT. The optimizer must combine individual selectivities to estimate the overall filter selectivity.

selectivity_combining.txt
=============================================================
COMBINING SELECTIVITIES (Assuming Independence)
=============================================================
 
Given predicates P1, P2 with selectivities s1, s2:
 
AND (Conjunction):
  Selectivity(P1 AND P2) = s1 × s2
  
  Example: status = 'active' AND country = 'USA'
            s1 = 0.4, s2 = 0.3
            Combined = 0.4 × 0.3 = 0.12 (12%)
 
OR (Disjunction):
  Selectivity(P1 OR P2) = s1 + s2 - (s1 × s2)
  
  Example: status = 'pending' OR status = 'processing'
            s1 = 0.35, s2 = 0.25
            Combined = 0.35 + 0.25 - (0.35 × 0.25) = 0.5125
 
NOT (Negation):
  Selectivity(NOT P1) = 1 - s1
  
  Example: NOT (status = 'deleted')
            s1 = 0.05
            Combined = 1 - 0.05 = 0.95
 
=============================================================
MULTI-PREDICATE EXAMPLE
=============================================================
 
Query: WHERE status = 'active' 
         AND total > 100 
         AND created_date > '2024-01-01'
         
Individual selectivities (from statistics):
  status = 'active':        0.40
  total > 100:              0.70  
  created_date > '2024-01-01': 0.25
 
Combined (assuming independence):
  0.40 × 0.70 × 0.25 = 0.07 (7%)
 
For 1,000,000 rows:
  Estimated output = 70,000 rows

The Independence Assumption Problem:

The multiplication rule for AND predicates assumes columns are statistically independent—knowing the value of one column tells nothing about another. This is often false:

Columns: state, city
Actual relationship: city determines state (functional dependency)

Predicate: state = 'California' AND city = 'Los Angeles'

Independence assumption:
  P(state=CA) = 0.12, P(city=LA) = 0.04
  Estimated: 0.12 × 0.04 = 0.0048 (0.48%)
  
Reality:
  All 'Los Angeles' rows have state = 'California'
  Actual: 0.04 (4%)
  
Underestimate: 8× error

This error cascades through joins and aggregations, potentially causing 100× or greater final estimation errors.

Mitigation Strategies:

Extended Statistics: Create dependency or MCV statistics on correlated columns
Expression Statistics: Collect statistics on combined expressions
Correlation Damping: Some optimizers reduce the multiplicative effect for multiple predicates

The AND Selectivity Floor

Some optimizers impose minimum selectivity floors to prevent extreme underestimation. PostgreSQL doesn't let combined AND selectivity fall below certain thresholds in some cases. However, this is a heuristic, not a solution—the real fix is extended statistics on correlated columns.

Join Cardinality Estimation

Join cardinality estimation is arguably the most challenging and impactful aspect of query optimization. The output of a join feeds into subsequent operations, so errors compound dramatically through the query plan.

The Basic Join Cardinality Formula:

|R ⋈ S| = |R| × |S| × Selectivity_join

For an equi-join on a common key:

Selectivity_join = 1 / MAX(n_distinct(R.key), n_distinct(S.key))

Rationale: If R has 100 distinct key values and S has 50 distinct values, on average each S value matches 100/100 = 1 row in R. But there are 50 distinct S values, so we get 50 groups of matches. With |R| and |S| rows, the expected output is:

|R| × |S| / MAX(n_distinct_R, n_distinct_S)

join_cardinality_examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
=============================================================
JOIN CARDINALITY ESTIMATION EXAMPLES
=============================================================
 
=== CASE 1: Primary Key to Foreign Key Join ===
orders: 1,000,000 rows, order_id is PK (n_distinct = 1,000,000)
order_items: 5,000,000 rows, order_id FK (n_distinct = 1,000,000)
 
|orders ⋈ order_items| = 1,000,000 × 5,000,000 / 1,000,000
                       = 5,000,000 rows
 
Interpretation: Each order has ~5 items on average, correct!
 
=== CASE 2: Many-to-Many Join ===
products: 50,000 rows, category_id (n_distinct = 100)
categories: 100 rows, id is PK (n_distinct = 100)
 
|products ⋈ categories| = 50,000 × 100 / MAX(100, 100)
                        = 50,000 rows
 
Each product matches exactly one category, correct!
 
=== CASE 3: Filtered Join ===
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'
 
Step 1: Apply filter to customers
  customers: 100,000 rows × 0.30 (USA) = 30,000 rows
  
Step 2: Join filtered customers with orders
  orders: 1,000,000 rows, customer_id n_distinct = 80,000
  filtered customers: 30,000 rows, id n_distinct = 30,000
  
  |orders ⋈ filtered_customers| = 1,000,000 × 30,000 / MAX(80,000, 30,000)
                                = 1,000,000 × 30,000 / 80,000
                                = 375,000 rows
 
=== CASE 4: Problematic Data Skew ===
orders: 1,000,000 rows, customer_id (n_distinct = 80,000)
Statistics show uniform distribution, but:
  Top 100 customers account for 50% of orders!
  
Query: JOIN for top_customer with 50,000 orders
Estimated: 1,000,000 / 80,000 = 12.5 orders per customer
Actual: 50,000 orders for this customer
 
Error: 4000× underestimate for this specific join!

The Foreign Key Advantage

When the optimizer knows about foreign key relationships (through constraints), it can produce tighter cardinality bounds. A FK from order_items.order_id to orders.id guarantees the join produces at most |order_items| rows. Modern optimizers exploit these semantic constraints when available.

How Cardinality Errors Propagate

Cardinality estimation errors don't stay contained—they propagate and amplify through the query plan. Understanding this propagation is essential for diagnosing complex performance problems.

Converting Mermaid diagram...

The Multiplicative Error Effect:

Error in A: 10×
Error in B: 10×
Error in A⋈B: Can be 10 × 10 = 100× or more!

Why worse than product?
- Join selectivity also estimated incorrectly
- Correlation between join columns not captured
- Skewed distributions amplify effects

Real-World Impact:

Estimation Error	Typical Consequence
2-5×	Usually acceptable; optimizer makes reasonable choices
5-10×	May cause suboptimal algorithm selection; 2-3× slowdown
10-100×	Likely to cause wrong join algorithm; 10× or worse slowdown
100-1000×	Almost certainly wrong plan; query may be unusable
1000×+	Catastrophic; timeouts, resource exhaustion, system impact

The Optimizer's Dilemma:

The optimizer must commit to a plan before execution. If it estimates 1,000 rows and allocates memory for a hash table holding 1,000 rows, but 1,000,000 rows actually flow through:

Hash table overflows to disk (grace hash join)
Performance drops 100× from memory to disk-based operation
Query that should take 1 second takes 100 seconds

This is why cardinality accuracy matters more than cost model precision.

The Multi-Join Nightmare

For queries joining 5+ tables, cardinality errors often compound to 10,000× or more by the final join. This is why complex analytical queries sometimes perform orders of magnitude worse than expected. The only reliable solutions are: better statistics, query hints, or plan guides for critical queries.

Diagnosing Cardinality Estimation Problems

Identifying and fixing cardinality estimation problems is a core database tuning skill. The primary diagnostic tool is comparing estimated versus actual row counts in execution plans.

cardinality_diagnosis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
-- PostgreSQL: Diagnosing cardinality estimation problems
-- ================================================================
 
-- STEP 1: Get estimated vs actual comparison
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT o.order_id, c.name, SUM(oi.quantity * oi.price)
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01'
  AND c.country = 'Germany'
GROUP BY o.order_id, c.name;
 
-- STEP 2: Look for large discrepancies in output
 
/*
Hash Aggregate  (cost=45678... rows=5000...)  
               (actual time=567... rows=125000 loops=1)
               ^^^^^^^^      ^^^^^^^^
               Estimated     Actual     <- 25× UNDERESTIMATE
 
  ->  Hash Join  (cost=23456... rows=15000...)
                 (actual time=234... rows=450000 loops=1)
                                          <- 30× UNDERESTIMATE
      
      ... deeper nodes ...
      
      ->  Seq Scan on customers c  (cost=... rows=8000...)
                                   (actual time=... rows=7850 loops=1)
                                                           <- OK (1.02×)
*/
 
-- STEP 3: Identify the source of the error
 
-- Check statistics freshness
SELECT relname, n_live_tup, n_mod_since_analyze, last_analyze
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers', 'order_items');
 
-- Check column statistics
SELECT attname, null_frac, n_distinct, 
       array_length(most_common_vals::text[], 1) as mcv_count
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'order_date';
 
-- Check for correlation issues
SELECT 
    s1.attname as col1,
    s2.attname as col2,
    'Potential correlation - consider extended stats'
FROM pg_stats s1
CROSS JOIN pg_stats s2
WHERE s1.tablename = 'customers' 
  AND s2.tablename = 'customers'
  AND s1.attname = 'country' 
  AND s2.attname = 'city';
 
-- STEP 4: Test if extended statistics help
CREATE STATISTICS customers_country_region (dependencies, ndistinct)
ON country, region, city FROM customers;
 
ANALYZE customers;
 
-- Re-run EXPLAIN ANALYZE and compare estimates

Cardinality Problem Checklist

•Check statistics freshness — When was last ANALYZE? How many modifications since?
•Compare estimate vs actual — Look for 10×+ discrepancies at each plan node
•Identify the first error — Errors propagate; find where the chain starts
•Check for correlated columns — Multiple predicates on related columns?
•Look for skewed data — Popular values not captured in MCVs?
•Examine complex predicates — UDFs, expressions, or pattern matching?
•Check join conditions — Many-to-many relationships or unusual correlations?
•Verify histogram coverage — Low-cardinality columns may need MCVs, not histograms

The 10× Rule

When estimated rows differ from actual by more than 10×, there's likely a statistics or modeling problem worth investigating. Under 5× is usually acceptable. Between 5-10× is worth monitoring. Over 10× demands investigation and likely requires statistics improvement or query restructuring.

Modern Approaches to Cardinality Estimation

Traditional cardinality estimation based on histograms and independence assumptions has fundamental limitations. Modern database systems explore several advanced approaches:

Advanced Cardinality Estimation Techniques

•Sampling-Based Estimation — Execute small samples of the query to gather runtime statistics. Some systems sample 0.1-1% of data to produce actual selectivity measurements, then extrapolate. More accurate than static statistics but adds planning latency.
•Adaptive Query Execution — Don't commit to a fixed plan. Monitor actual cardinalities during execution and adjust algorithms mid-query. If a hash join's build side exceeds expectations, switch to grace hash join dynamically.
•Machine Learning Models — Train neural networks or gradient boosting models on historical query execution data. The model learns relationships between query features and actual cardinalities. Used in Google's Borealis, MIT's Naru, and Microsoft's research systems.
•Bayesian Networks — Model column dependencies explicitly as probabilistic graphical models. Capture correlations that MCV lists miss. More complex to maintain but more accurate for correlated data.
•Feedback-Driven Statistics — After query execution, compare predictions to reality. Use discrepancies to improve statistics automatically. Create focused statistics on problematic expressions or column combinations.
•Query Result Caching — Cache actual cardinalities from previous executions. When the same subquery appears again, use cached values instead of estimating. Particularly effective for repeated analytical queries.

adaptive_estimation.txt
=============================================================
ADAPTIVE QUERY EXECUTION (Conceptual Flow)
=============================================================
 
Traditional (Static) Planning:
1. Parse query
2. Estimate cardinalities using statistics
3. Generate optimal plan based on estimates
4. Execute plan
5. (Estimates may have been 100× wrong - too late!)
 
Adaptive Execution (e.g., Oracle, SQL Server, Databricks):
1. Parse query
2. Estimate cardinalities using statistics
3. Generate initial plan with "adaptation points"
4. Begin execution
5. At first adaptation point:
   - Compare actual cardinality to estimate
   - If within threshold: continue current plan
   - If different: reoptimize remaining operations
6. Continue with potentially adjusted plan
7. Log discrepancy for future statistics improvement
 
Example - Hash Join Adaptation:
 
Initial estimate: 10,000 rows from filter
Plan: Build hash table on 10K rows, probe with 1M rows
 
Execution begins:
- After scanning: 500,000 rows pass filter (50× more!)
- Adaptation triggered
- Replan: Switch build/probe sides, allocate more memory
- Continue with adjusted strategy
 
Result: Query completes in 10 seconds instead of failing
 
=============================================================
MACHINE LEARNING CARDINALITY ESTIMATION (Research Systems)
=============================================================
 
Training Phase:
- Execute thousands of queries with EXPLAIN ANALYZE
- Extract features: columns referenced, predicate types, table sizes
- Training labels: actual cardinalities from execution
- Train model (neural network, XGBoost, etc.)
 
Inference Phase:
- New query arrives
- Extract query features
- Feed to trained model
- Model outputs cardinality predictions
- Optimizer uses ML predictions instead of formula-based estimates
 
Advantages:
- Captures complex relationships histograms miss
- Learns from actual workload patterns
- Improves automatically as more queries execute
 
Challenges:
- Requires significant training data
- Model maintenance as data changes
- Explaining/debugging ML predictions is hard
- Cold start for new tables or query patterns

The Production Reality

While ML-based estimation shows promising research results (10× better accuracy than traditional), production adoption is limited. Most enterprises still rely on traditional statistics with extended statistics for known problem areas. Adaptive execution is more widely deployed (Oracle 12c+, SQL Server 2017+, Spark 3.0+) as it's less invasive than replacing the entire estimation system.

Practical Solutions for Cardinality Problems

When you encounter cardinality estimation problems in production, here are practical remediation strategies:

Statistics Improvements

•Run ANALYZE — The simplest fix. Always try first.
•Increase statistics target — More histogram buckets, more MCVs.
•Create extended statistics — For correlated columns.
•Add expression statistics — For computed predicates.
•Reduce analyze threshold — More frequent auto-analyze.

Query-Level Solutions

•Rewrite predicates — Simplify complex expressions.
•Materialize CTEs — Gives optimizer cardinality checkpoints.
•Use explicit joins — Avoid implicit comma joins.
•Break complex queries — Intermediate tables with stats.
•Use hints (carefully) — Override estimates when needed.

cardinality_fixes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
-- PostgreSQL: Practical cardinality fixes
-- ================================================================
 
-- 1. Force ANALYZE with increased statistics
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;
ALTER TABLE orders ALTER COLUMN order_date SET STATISTICS 500;
ANALYZE orders;
 
-- 2. Create extended statistics for correlated columns
CREATE STATISTICS orders_customer_date (dependencies, ndistinct, mcv)
ON customer_id, order_date FROM orders;
ANALYZE orders;
 
-- 3. Create statistics on computed expression
CREATE STATISTICS orders_year_month (ndistinct)
ON (EXTRACT(YEAR FROM order_date)), (EXTRACT(MONTH FROM order_date))
FROM orders;
ANALYZE orders;
 
-- 4. Materialize problematic subquery to get accurate stats
-- Instead of:
SELECT * FROM (
    SELECT customer_id, COUNT(*) as order_count
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
) subq
WHERE order_count > 100;
 
-- Create materialized intermediate:
CREATE MATERIALIZED VIEW recent_customer_order_counts AS
SELECT customer_id, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;
 
ANALYZE recent_customer_order_counts;
 
-- Now query the materialized view
SELECT * FROM recent_customer_order_counts WHERE order_count > 100;
 
-- 5. Use ROWS estimate hint for stubborn UDFs (PostgreSQL)
CREATE FUNCTION get_customer_tier(customer_id INT)
RETURNS TEXT AS $$
  -- Complex logic here
$$ LANGUAGE plpgsql STABLE ROWS 1;
 
-- Tells optimizer: this function returns approximately 1 row per input
 
-- 6. Consider pg_hint_plan extension for row count overrides
-- (Third-party extension)
/*
SET pg_hint_plan.enable_hint = on;
 
SELECT /*+ Rows(orders #100) */ *
FROM orders o
JOIN order_items oi ON o.id = oi.order_id;
*/

The Hint Trap

Query hints that override cardinality estimates (like Oracle's CARDINALITY hint or PostgreSQL pg_hint_plan) should be a last resort. They encode specific assumptions that become incorrect as data changes. Document why hints were added and schedule periodic review to ensure they remain valid.

Summary: Mastering Cardinality Estimation

Cardinality estimation is the heart of query optimization. Let's consolidate the essential knowledge:

Key Takeaways

•Cardinality = Input × Selectivity — This fundamental formula underlies all estimation. Accuracy of selectivity determines accuracy of cardinality.
•MCVs, histograms, and n_distinct enable selectivity estimation — Different structures serve different predicate types; all require current statistics.
•Independence assumption causes errors — AND predicates on correlated columns produce multiplicative underestimates. Extended statistics are the remedy.
•Join cardinality uses n_distinct — The formula 1/MAX(n_distinct) assumes uniform distribution; skewed data causes large errors.
•Errors compound through the plan — A 10× error at one join can become 100× by the next join. Multi-join queries are especially vulnerable.
•Diagnosis compares estimated vs actual — EXPLAIN ANALYZE reveals discrepancies. Look for the first large error—subsequent errors are often consequences.
•Modern systems use adaptive execution — Rather than perfect prediction, systems detect and adapt to estimation errors during execution.

Module Complete:

You've now completed the Cost Models module, developing comprehensive expertise in:

Cost Estimation — The framework and purpose of predicting query costs
I/O Cost — Page-based storage access modeling
CPU Cost — Tuple processing and computation overhead
Statistics Usage — The metadata foundation of cost modeling
Cardinality Estimation — Predicting row counts through operators

This knowledge enables you to interpret query plans, diagnose performance problems, and tune database systems at a deep level. You understand not just what the optimizer does, but why it makes specific choices and how to influence those choices when needed.

Module Complete: Cost Models

Congratulations! You've mastered the theory and practice of cost modeling in database systems. From I/O and CPU costs through statistics and cardinality estimation, you now possess Principal Engineer-level understanding of how query optimizers make decisions. This knowledge is essential for building, tuning, and troubleshooting high-performance database applications.