Loading content...
If you could master only one aspect of query optimization, it should be cardinality estimation. Every decision the optimizer makes—join algorithms, access methods, parallelism, memory allocation—depends critically on predicting how many rows each operation will process and produce.
Cardinality estimation answers the question: "How many rows will this operation return?" For a simple filter, this means predicting how many rows satisfy the WHERE clause. For a join, it means predicting how many pairs will match the join condition. For a GROUP BY, it means predicting how many distinct groups will form.
When cardinality estimation is accurate, the optimizer makes brilliant choices. When it's wrong—and it often is—the results can be catastrophic. A 100× overestimate might cause allocation of vast amounts of memory that won't be used. A 100× underestimate might trigger a nested loop join that takes hours instead of the hash join that would take seconds.
This page develops deep expertise in cardinality estimation: how it works, why it fails, and how to fix it.
By the end of this page, you will understand how selectivity factors combine to estimate filter cardinality, join cardinality estimation formulas and their assumptions, the major sources of estimation error and how they compound, modern approaches including adaptive estimation and machine learning, and practical diagnosis and resolution of cardinality problems.
Selectivity is the fraction of rows that satisfy a predicate. It ranges from 0.0 (no rows match) to 1.0 (all rows match). Cardinality estimation for filters is:
Output cardinality = Input cardinality × Selectivity
For a table with 1,000,000 rows and a predicate with selectivity 0.05 (5%):
The challenge is determining selectivity accurately. Different predicate types use different estimation techniques:
| Predicate Type | Example | Estimation Method |
|---|---|---|
| Equality (indexed) | id = 42 | 1/n_distinct if uniform; MCV check first |
| Equality (MCV match) | status = 'active' | Lookup frequency in MCV list |
| Equality (non-MCV) | city = 'Smallville' | (1 - sum(mcv_freqs)) / (n_distinct - mcv_count) |
| NULL check | IS NULL / IS NOT NULL | null_frac or (1 - null_frac) |
| Range (numeric) | age BETWEEN 20 AND 30 | Histogram bucket interpolation |
| Range (open-ended) | price > 1000 | Sum histogram bucket fractions beyond bound |
| LIKE (prefix) | LIKE 'John%' | Histogram range on prefix bounds |
| LIKE (pattern) | LIKE '%@gmail%' | Default selectivity (~0.005 to 0.02) |
| IN list | IN (1, 2, 3) | Sum of individual equality selectivities |
| NOT condition | status != 'deleted' | 1 - selectivity(status = 'deleted') |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
=============================================================SELECTIVITY CALCULATION WALKTHROUGH============================================================= Table: orders (1,000,000 rows)Column: status (MCV list: {pending:0.35, processing:0.25, shipped:0.20, delivered:0.15, cancelled:0.05}) Case 1: status = 'pending' (MCV match) Selectivity = 0.35 (directly from MCV frequency) Estimated rows = 1,000,000 × 0.35 = 350,000 Case 2: status = 'archived' (not in MCV list) MCV covers: 0.35 + 0.25 + 0.20 + 0.15 + 0.05 = 1.00 (all values!) No non-MCV values exist Selectivity = 0.0 (or near-zero default) This correctly identifies 'archived' as non-existent Case 3: status != 'cancelled' Selectivity = 1 - 0.05 = 0.95 Estimated rows = 1,000,000 × 0.95 = 950,000 =============================================================Histogram-based range estimation============================================================= Column: total_amountHistogram: 100 equal-height buckets spanning [0, 10000]Each bucket represents 1% of rows (after MCV exclusion)MCV values removed: ~5% of rows in MCVs Query: WHERE total_amount BETWEEN 500 AND 2000 Histogram bounds at 500: between bucket 5 and 6Histogram bounds at 2000: between bucket 20 and 21 Full buckets covered: 6, 7, 8, ..., 20 = 15 bucketsPartial bucket at start: ~50% of bucket 5Partial bucket at end: ~20% of bucket 21 Non-MCV selectivity: (14 + 0.5 + 0.2) × 0.01 = 0.147 (14.7%)Apply to non-MCV fraction: 0.147 × 0.95 = 0.13965Add any MCVs in range: assume 2% in this rangeFinal selectivity: 0.14 + 0.02 = 0.16 Estimated rows: 1,000,000 × 0.16 = 160,000When the optimizer can't compute selectivity (complex expressions, missing statistics, UDF predicates), it falls back to defaults: ~0.33 for range predicates, ~0.005 for equality, ~0.25 for inequality. These guesses are often wildly wrong, causing significant estimation errors for complex queries.
Real queries have multiple predicates combined with AND, OR, and NOT. The optimizer must combine individual selectivities to estimate the overall filter selectivity.
=============================================================COMBINING SELECTIVITIES (Assuming Independence)============================================================= Given predicates P1, P2 with selectivities s1, s2: AND (Conjunction): Selectivity(P1 AND P2) = s1 × s2 Example: status = 'active' AND country = 'USA' s1 = 0.4, s2 = 0.3 Combined = 0.4 × 0.3 = 0.12 (12%) OR (Disjunction): Selectivity(P1 OR P2) = s1 + s2 - (s1 × s2) Example: status = 'pending' OR status = 'processing' s1 = 0.35, s2 = 0.25 Combined = 0.35 + 0.25 - (0.35 × 0.25) = 0.5125 NOT (Negation): Selectivity(NOT P1) = 1 - s1 Example: NOT (status = 'deleted') s1 = 0.05 Combined = 1 - 0.05 = 0.95 =============================================================MULTI-PREDICATE EXAMPLE============================================================= Query: WHERE status = 'active' AND total > 100 AND created_date > '2024-01-01' Individual selectivities (from statistics): status = 'active': 0.40 total > 100: 0.70 created_date > '2024-01-01': 0.25 Combined (assuming independence): 0.40 × 0.70 × 0.25 = 0.07 (7%) For 1,000,000 rows: Estimated output = 70,000 rowsThe Independence Assumption Problem:
The multiplication rule for AND predicates assumes columns are statistically independent—knowing the value of one column tells nothing about another. This is often false:
Columns: state, city
Actual relationship: city determines state (functional dependency)
Predicate: state = 'California' AND city = 'Los Angeles'
Independence assumption:
P(state=CA) = 0.12, P(city=LA) = 0.04
Estimated: 0.12 × 0.04 = 0.0048 (0.48%)
Reality:
All 'Los Angeles' rows have state = 'California'
Actual: 0.04 (4%)
Underestimate: 8× error
This error cascades through joins and aggregations, potentially causing 100× or greater final estimation errors.
Mitigation Strategies:
Some optimizers impose minimum selectivity floors to prevent extreme underestimation. PostgreSQL doesn't let combined AND selectivity fall below certain thresholds in some cases. However, this is a heuristic, not a solution—the real fix is extended statistics on correlated columns.
Join cardinality estimation is arguably the most challenging and impactful aspect of query optimization. The output of a join feeds into subsequent operations, so errors compound dramatically through the query plan.
The Basic Join Cardinality Formula:
|R ⋈ S| = |R| × |S| × Selectivity_join
For an equi-join on a common key:
Selectivity_join = 1 / MAX(n_distinct(R.key), n_distinct(S.key))
Rationale: If R has 100 distinct key values and S has 50 distinct values, on average each S value matches 100/100 = 1 row in R. But there are 50 distinct S values, so we get 50 groups of matches. With |R| and |S| rows, the expected output is:
|R| × |S| / MAX(n_distinct_R, n_distinct_S)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
=============================================================JOIN CARDINALITY ESTIMATION EXAMPLES============================================================= === CASE 1: Primary Key to Foreign Key Join ===orders: 1,000,000 rows, order_id is PK (n_distinct = 1,000,000)order_items: 5,000,000 rows, order_id FK (n_distinct = 1,000,000) |orders ⋈ order_items| = 1,000,000 × 5,000,000 / 1,000,000 = 5,000,000 rows Interpretation: Each order has ~5 items on average, correct! === CASE 2: Many-to-Many Join ===products: 50,000 rows, category_id (n_distinct = 100)categories: 100 rows, id is PK (n_distinct = 100) |products ⋈ categories| = 50,000 × 100 / MAX(100, 100) = 50,000 rows Each product matches exactly one category, correct! === CASE 3: Filtered Join ===SELECT * FROM orders o JOIN customers c ON o.customer_id = c.idWHERE c.country = 'USA' Step 1: Apply filter to customers customers: 100,000 rows × 0.30 (USA) = 30,000 rows Step 2: Join filtered customers with orders orders: 1,000,000 rows, customer_id n_distinct = 80,000 filtered customers: 30,000 rows, id n_distinct = 30,000 |orders ⋈ filtered_customers| = 1,000,000 × 30,000 / MAX(80,000, 30,000) = 1,000,000 × 30,000 / 80,000 = 375,000 rows === CASE 4: Problematic Data Skew ===orders: 1,000,000 rows, customer_id (n_distinct = 80,000)Statistics show uniform distribution, but: Top 100 customers account for 50% of orders! Query: JOIN for top_customer with 50,000 ordersEstimated: 1,000,000 / 80,000 = 12.5 orders per customerActual: 50,000 orders for this customer Error: 4000× underestimate for this specific join!When the optimizer knows about foreign key relationships (through constraints), it can produce tighter cardinality bounds. A FK from order_items.order_id to orders.id guarantees the join produces at most |order_items| rows. Modern optimizers exploit these semantic constraints when available.
Cardinality estimation errors don't stay contained—they propagate and amplify through the query plan. Understanding this propagation is essential for diagnosing complex performance problems.
The Multiplicative Error Effect:
Error in A: 10×
Error in B: 10×
Error in A⋈B: Can be 10 × 10 = 100× or more!
Why worse than product?
- Join selectivity also estimated incorrectly
- Correlation between join columns not captured
- Skewed distributions amplify effects
Real-World Impact:
| Estimation Error | Typical Consequence |
|---|---|
| 2-5× | Usually acceptable; optimizer makes reasonable choices |
| 5-10× | May cause suboptimal algorithm selection; 2-3× slowdown |
| 10-100× | Likely to cause wrong join algorithm; 10× or worse slowdown |
| 100-1000× | Almost certainly wrong plan; query may be unusable |
| 1000×+ | Catastrophic; timeouts, resource exhaustion, system impact |
The Optimizer's Dilemma:
The optimizer must commit to a plan before execution. If it estimates 1,000 rows and allocates memory for a hash table holding 1,000 rows, but 1,000,000 rows actually flow through:
This is why cardinality accuracy matters more than cost model precision.
For queries joining 5+ tables, cardinality errors often compound to 10,000× or more by the final join. This is why complex analytical queries sometimes perform orders of magnitude worse than expected. The only reliable solutions are: better statistics, query hints, or plan guides for critical queries.
Identifying and fixing cardinality estimation problems is a core database tuning skill. The primary diagnostic tool is comparing estimated versus actual row counts in execution plans.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
-- PostgreSQL: Diagnosing cardinality estimation problems-- ================================================================ -- STEP 1: Get estimated vs actual comparisonEXPLAIN (ANALYZE, VERBOSE, BUFFERS)SELECT o.order_id, c.name, SUM(oi.quantity * oi.price)FROM orders oJOIN customers c ON o.customer_id = c.idJOIN order_items oi ON o.order_id = oi.order_idWHERE o.order_date >= '2024-01-01' AND c.country = 'Germany'GROUP BY o.order_id, c.name; -- STEP 2: Look for large discrepancies in output /*Hash Aggregate (cost=45678... rows=5000...) (actual time=567... rows=125000 loops=1) ^^^^^^^^ ^^^^^^^^ Estimated Actual <- 25× UNDERESTIMATE -> Hash Join (cost=23456... rows=15000...) (actual time=234... rows=450000 loops=1) <- 30× UNDERESTIMATE ... deeper nodes ... -> Seq Scan on customers c (cost=... rows=8000...) (actual time=... rows=7850 loops=1) <- OK (1.02×)*/ -- STEP 3: Identify the source of the error -- Check statistics freshnessSELECT relname, n_live_tup, n_mod_since_analyze, last_analyzeFROM pg_stat_user_tablesWHERE relname IN ('orders', 'customers', 'order_items'); -- Check column statisticsSELECT attname, null_frac, n_distinct, array_length(most_common_vals::text[], 1) as mcv_countFROM pg_statsWHERE tablename = 'orders' AND attname = 'order_date'; -- Check for correlation issuesSELECT s1.attname as col1, s2.attname as col2, 'Potential correlation - consider extended stats'FROM pg_stats s1CROSS JOIN pg_stats s2WHERE s1.tablename = 'customers' AND s2.tablename = 'customers' AND s1.attname = 'country' AND s2.attname = 'city'; -- STEP 4: Test if extended statistics helpCREATE STATISTICS customers_country_region (dependencies, ndistinct)ON country, region, city FROM customers; ANALYZE customers; -- Re-run EXPLAIN ANALYZE and compare estimatesWhen estimated rows differ from actual by more than 10×, there's likely a statistics or modeling problem worth investigating. Under 5× is usually acceptable. Between 5-10× is worth monitoring. Over 10× demands investigation and likely requires statistics improvement or query restructuring.
Traditional cardinality estimation based on histograms and independence assumptions has fundamental limitations. Modern database systems explore several advanced approaches:
=============================================================ADAPTIVE QUERY EXECUTION (Conceptual Flow)============================================================= Traditional (Static) Planning:1. Parse query2. Estimate cardinalities using statistics3. Generate optimal plan based on estimates4. Execute plan5. (Estimates may have been 100× wrong - too late!) Adaptive Execution (e.g., Oracle, SQL Server, Databricks):1. Parse query2. Estimate cardinalities using statistics3. Generate initial plan with "adaptation points"4. Begin execution5. At first adaptation point: - Compare actual cardinality to estimate - If within threshold: continue current plan - If different: reoptimize remaining operations6. Continue with potentially adjusted plan7. Log discrepancy for future statistics improvement Example - Hash Join Adaptation: Initial estimate: 10,000 rows from filterPlan: Build hash table on 10K rows, probe with 1M rows Execution begins:- After scanning: 500,000 rows pass filter (50× more!)- Adaptation triggered- Replan: Switch build/probe sides, allocate more memory- Continue with adjusted strategy Result: Query completes in 10 seconds instead of failing =============================================================MACHINE LEARNING CARDINALITY ESTIMATION (Research Systems)============================================================= Training Phase:- Execute thousands of queries with EXPLAIN ANALYZE- Extract features: columns referenced, predicate types, table sizes- Training labels: actual cardinalities from execution- Train model (neural network, XGBoost, etc.) Inference Phase:- New query arrives- Extract query features- Feed to trained model- Model outputs cardinality predictions- Optimizer uses ML predictions instead of formula-based estimates Advantages:- Captures complex relationships histograms miss- Learns from actual workload patterns- Improves automatically as more queries execute Challenges:- Requires significant training data- Model maintenance as data changes- Explaining/debugging ML predictions is hard- Cold start for new tables or query patternsWhile ML-based estimation shows promising research results (10× better accuracy than traditional), production adoption is limited. Most enterprises still rely on traditional statistics with extended statistics for known problem areas. Adaptive execution is more widely deployed (Oracle 12c+, SQL Server 2017+, Spark 3.0+) as it's less invasive than replacing the entire estimation system.
When you encounter cardinality estimation problems in production, here are practical remediation strategies:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
-- PostgreSQL: Practical cardinality fixes-- ================================================================ -- 1. Force ANALYZE with increased statisticsALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;ALTER TABLE orders ALTER COLUMN order_date SET STATISTICS 500;ANALYZE orders; -- 2. Create extended statistics for correlated columnsCREATE STATISTICS orders_customer_date (dependencies, ndistinct, mcv)ON customer_id, order_date FROM orders;ANALYZE orders; -- 3. Create statistics on computed expressionCREATE STATISTICS orders_year_month (ndistinct)ON (EXTRACT(YEAR FROM order_date)), (EXTRACT(MONTH FROM order_date))FROM orders;ANALYZE orders; -- 4. Materialize problematic subquery to get accurate stats-- Instead of:SELECT * FROM ( SELECT customer_id, COUNT(*) as order_count FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id) subqWHERE order_count > 100; -- Create materialized intermediate:CREATE MATERIALIZED VIEW recent_customer_order_counts ASSELECT customer_id, COUNT(*) as order_countFROM ordersWHERE order_date >= '2024-01-01'GROUP BY customer_id; ANALYZE recent_customer_order_counts; -- Now query the materialized viewSELECT * FROM recent_customer_order_counts WHERE order_count > 100; -- 5. Use ROWS estimate hint for stubborn UDFs (PostgreSQL)CREATE FUNCTION get_customer_tier(customer_id INT)RETURNS TEXT AS $$ -- Complex logic here$$ LANGUAGE plpgsql STABLE ROWS 1; -- Tells optimizer: this function returns approximately 1 row per input -- 6. Consider pg_hint_plan extension for row count overrides-- (Third-party extension)/*SET pg_hint_plan.enable_hint = on; SELECT /*+ Rows(orders #100) */ *FROM orders oJOIN order_items oi ON o.id = oi.order_id;*/Query hints that override cardinality estimates (like Oracle's CARDINALITY hint or PostgreSQL pg_hint_plan) should be a last resort. They encode specific assumptions that become incorrect as data changes. Document why hints were added and schedule periodic review to ensure they remain valid.
Cardinality estimation is the heart of query optimization. Let's consolidate the essential knowledge:
Module Complete:
You've now completed the Cost Models module, developing comprehensive expertise in:
This knowledge enables you to interpret query plans, diagnose performance problems, and tune database systems at a deep level. You understand not just what the optimizer does, but why it makes specific choices and how to influence those choices when needed.
Congratulations! You've mastered the theory and practice of cost modeling in database systems. From I/O and CPU costs through statistics and cardinality estimation, you now possess Principal Engineer-level understanding of how query optimizers make decisions. This knowledge is essential for building, tuning, and troubleshooting high-performance database applications.