Database Management SystemsStatistics and Histograms

Statistics and Histograms

LevelAdvanced

Duration90 mins

TopicStatistics and Histograms

5 / 5

Estimation Accuracy

The Make-or-Break Metric

All the statistics we've collected—table sizes, column distributions, histograms, correlations—converge on a single critical output: cardinality estimation. This is the optimizer's prediction of how many rows will flow through each node of the query plan.

Get the cardinality right, and the optimizer makes brilliant decisions. Get it wrong, and even a sophisticated optimizer produces disastrous plans. A query expected to return 100 rows might actually return 1 million. The optimizer chose a nested loop join (perfect for 100 rows, catastrophic for 1 million), and the query runs 10,000x slower than optimal.

What You Will Learn

By the end of this page, you will understand how cardinality estimation works, the sources of estimation error, how errors compound through join operations, techniques for diagnosing and mitigating estimation problems, and the fundamental limits of row estimation.

Cardinality Estimation Fundamentals

Cardinality in query optimization means the number of rows produced by an operator or flowing through a node in the query plan. Cardinality estimation is the process of predicting these counts without executing the query.

How Cardinality Estimation Works:

Input: Base table statistics + Query predicates
Output: Estimated row count at each plan node

Process:
1. Start with base table cardinality (from statistics)
2. Apply selectivity factors for each predicate:
   estimated_rows = base_rows × selectivity
3. Propagate estimates up through the plan tree
4. Use join selectivity formulas for joining tables

Example:

SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'pending'
  AND o.order_date > '2024-01-01'
  AND c.country = 'US';

Base tables:
  orders: 10,000,000 rows
  customers: 1,000,000 rows

Estimation steps:
1. orders after status filter:
   sel(status='pending') = 0.20 (from MCV)
   10,000,000 × 0.20 = 2,000,000 rows

2. orders after date filter:
   sel(date > '2024-01-01') = 0.30 (from histogram)
   Combined: 2,000,000 × 0.30 = 600,000 rows
   (assumes independence)

3. customers after country filter:
   sel(country='US') = 0.40 (from MCV)
   1,000,000 × 0.40 = 400,000 rows

4. Join cardinality:
   join_sel = 1 / max(NDV(o.customer_id), NDV(c.customer_id))
   join_sel = 1 / 1,000,000 = 0.000001
   
   600,000 × 400,000 × 0.000001 = 240,000 result rows

Cardinality Formulas by Operator Type
Operator	Cardinality Formula	Key Dependencies
Table Scan	table_rows (from statistics)	Accurate row count
Selection (σ)	input_rows × selectivity	Predicate selectivity
Projection (π)	input_rows (unchanged)	None
Inner Join (⋈)	(rows_A × rows_B) × join_selectivity	Join key NDV
Left Outer Join	rows_A + (rows_A × rows_B × sel) - matched	Match rate estimation
GROUP BY	NDV of grouping columns	Multi-column NDV
DISTINCT	NDV of result columns	Multi-column NDV
UNION	rows_A + rows_B (may adjust for overlap)	Optional overlap estimation

Everything Depends on Cardinality

Every cost calculation depends on cardinality estimates. Join order, access method selection, memory allocation, parallelization decisions—all flow from cardinality. A 10x cardinality error can cause a 100x or 1000x performance error through cascading decision impacts.

Sources of Estimation Error

Cardinality estimation is fundamentally approximate. Many sources of error are inherent to the statistical approach.

Major Sources of Estimation Error

•Stale Statistics — Data has changed since statistics were collected. Most common and most preventable source of error.
•Independence Assumption — Optimizers assume P(A AND B) = P(A) × P(B). Correlated columns violate this assumption dramatically.
•Uniform Distribution Assumption — Values within histogram buckets assumed uniform. Clustering within buckets causes error.
•Missing Histograms — No histogram means falling back to 1/NDV or range-based estimates that may be wildly wrong.
•Join Selectivity Estimation — Assumes uniform key distribution and containment. Real key distributions often violate both.
•Function Wrapping — WHERE YEAR(date_col) = 2024 can't use date_col statistics. Optimizer often assumes default 1% selectivity.
•Complex Predicates — WHERE col LIKE '%pattern%' or user-defined functions have unknown selectivity.
•Subquery Estimation — Correlated subqueries create complex estimation dependencies.

The Independence Problem in Detail:

-- Query with correlated columns
SELECT * FROM employees
WHERE department = 'Engineering'
  AND skill = 'Python';

-- True correlation: 90% of Python developers are in Engineering

Optimizer's calculation (independence):
  P(Engineering) = 0.30
  P(Python) = 0.20
  P(Engineering AND Python) = 0.30 × 0.20 = 0.06 (6%)

Actual probability: 0.18 (18%) — 3x underestimate!

-- The optimizer underestimates, potentially choosing:
--   - Index scan instead of table scan
--   - Nested loop instead of hash join

The Function Selectivity Trap

When a predicate wraps a column in a function, statistics become useless. WHERE UPPER(name) = 'JOHN' doesn't match the 'name' histogram. Many optimizers default to magic numbers like 1%, 5%, or 10% selectivity. These guesses are correct only by accident.

Error Propagation and Compounding

Individual estimation errors are problematic; error propagation through joins is catastrophic. Join cardinality estimation multiplies errors, often exponentially.

The Join Explosion:

Consider a 4-way join:
  A ⋈ B ⋈ C ⋈ D

True cardinalities:
  |A| = 100,000
  |A ⋈ B| = 10,000
  |A ⋈ B ⋈ C| = 5,000
  |A ⋈ B ⋈ C ⋈ D| = 1,000

With 2x overestimate at each join:
  |A ⋈ B|_est = 20,000            (2x off)
  |A ⋈ B ⋈ C|_est = 20,000        (4x off)
  |A ⋈ B ⋈ C ⋈ D|_est = 8,000     (8x off)

With 3x overestimate at each join:
  |A ⋈ B|_est = 30,000            (3x off)
  |A ⋈ B ⋈ C|_est = 45,000        (9x off)
  |A ⋈ B ⋈ C ⋈ D|_est = 27,000    (27x off!)

With n joins and an average per-join error factor of e, final cardinality error can be e^n. A 2-table query with 3x error stays 3x wrong. A 5-table query becomes 243x wrong.

Error Compounding Across Join Depth
Joins	2x Error/Join	3x Error/Join	5x Error/Join
1	2x	3x	5x
2	4x	9x	25x
3	8x	27x	125x
4	16x	81x	625x
5	32x	243x	3,125x

Why Join Order Matters More Than You Think:

Given severe underestimate for A ⋈ B:
  True: 1,000,000 rows
  Estimated: 1,000 rows

Optimizer's plan (based on estimate):
  Build tiny hash table for A ⋈ B (1,000 rows fits in 64KB)
  Stream C through it

What actually happens:
  Hash table needs 1,000,000 rows → 64MB
  Exceeds memory → spills to disk
  Performance: 100x slower than estimated

Better plan (if estimate was correct):
  Build hash on C (smaller table)
  Stream A ⋈ B through it
  Much better memory utilization

The Optimizer's Dilemma

The optimizer chooses plans based on estimates it believes to be correct. When estimates are wrong, it's not making 'bad' decisions—it's making optimal decisions for the wrong problem. This is why cardinality accuracy is existentially important.

Diagnosing Estimation Problems

When queries perform poorly, cardinality misestimation is often the culprit. Here's how to diagnose it systematically.

Cardinality Diagnosis Techniques
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Step 1: Get estimated vs. actual cardinalities
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'pending';
 
-- Look for discrepancies like:
-- Hash Join (cost=... rows=1000) (actual rows=150000)
--      ↑ This is the problem! 150x underestimate
 
-- Step 2: Examine per-operator estimates
EXPLAIN (ANALYZE, VERBOSE, FORMAT JSON)  -- JSON for programmatic analysis
SELECT ...;
 
-- Step 3: Check statistics on suspicious columns
SELECT 
    attname,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs,
    histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
 
-- Step 4: Test selectivity in isolation
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Compare estimated rows to: SELECT COUNT(*) FROM orders WHERE status = 'pending';
 
-- Step 5: Check for missing extended statistics
SELECT * FROM pg_statistic_ext WHERE stxrelid = 'orders'::regclass;
 
-- Step 6: Create extended statistics if correlation suspected
CREATE STATISTICS ext_orders_status_date (dependencies)
ON status, order_date FROM orders;
ANALYZE orders;
 
-- Re-run EXPLAIN ANALYZE to see if estimate improved

The 10x Rule of Thumb

If actual rows differ from estimates by more than 10x at any operator, you have a cardinality problem worth investigating. Errors less than 10x rarely affect plan quality significantly due to cost model robustness.

Mitigating Estimation Errors

Once you've diagnosed estimation problems, several mitigation strategies can improve accuracy.

1. Refresh Stale Statistics

-- The first thing to try: fresh ANALYZE
ANALYZE affected_table;

-- With higher precision if needed
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

2. Create Extended/Multi-Column Statistics

-- PostgreSQL: Correlation-aware statistics
CREATE STATISTICS ext_status_date (dependencies)
    ON status, order_date FROM orders;

-- SQL Server: Multi-column statistics
CREATE STATISTICS stat_status_date
    ON orders (status, order_date)
    WITH FULLSCAN;

-- Oracle: Column groups
SELECT DBMS_STATS.CREATE_EXTENDED_STATS(
    'SALES', 'ORDERS', '(STATUS, ORDER_DATE)'
) FROM DUAL;

3. Create Expression Statistics

-- PostgreSQL 14+: Statistics on expressions
CREATE STATISTICS expr_year 
    ON (EXTRACT(YEAR FROM order_date)) 
    FROM orders;
ANALYZE orders;

-- Now WHERE EXTRACT(YEAR FROM order_date) = 2024
-- uses proper histogram instead of default 1%

Hints Are a Last Resort

Cardinality hints and forced plans should be used sparingly. They create maintenance burden, don't adapt to data changes, and mask underlying problems. Always prefer fixing statistics first.

Adaptive Query Execution

Modern databases incorporate adaptive execution features that can correct cardinality errors during query execution, not just during planning.

SQL Server Adaptive Query Processing (2017+):

1. Memory Grant Feedback
   - Hash joins that spill to tempdb remember the spill
   - Next execution gets more memory automatically
   - Gradually converges to optimal memory grant

2. Adaptive Joins
   - Plan includes BOTH nested loop and hash join choices
   - At runtime, chooses based on actual row count
   - If <threshold rows: nested loop; else: hash join

3. Interleaved Execution
   - Multi-statement table-valued functions
   - Execute TVF first, use actual cardinality for rest of plan

Oracle Adaptive Plans (12c+):

1. Adaptive Plan Selection
   - Plan compiled with "statistics collector" operators
   - During first execution, actual cardinality is observed
   - Plan switches sub-plans based on observed cardinality

2. SQL Plan Directives
   - When estimates are wrong, directive is created
   - Next execution uses dynamic sampling for that table/predicate
   - Automatically improves estimates without manual intervention

3. Auto-Created Extended Statistics
   - Optimizer tracks which column groups have correlated estimates  
   - Automatically creates column group statistics on problematic groups

PostgreSQL Adaptive Mechanisms:

PostgreSQL takes a more conservative approach:

1. No runtime plan switching
   - Plan is fixed at compile time
   - Philosophy: predictable execution over adaptive optimization

2. JIT Compilation Adapts to Row Count
   - Large result sets trigger JIT compilation
   - Based on actual execution, not estimates

3. Parallel Query Adaptation
   - Workers can terminate early if estimates were wrong
   - But initial worker count is based on estimates

General guidance for PostgreSQL:
  - Focus on statistics accuracy upfront
  - Use extended statistics for correlations
  - Consider pg_hint_plan extension for problem queries

The Future Is Adaptive

Adaptive query execution represents a paradigm shift: acknowledging that estimates will be wrong and building plans that can recover. New databases and versions increasingly incorporate these features. Stay current with your database's adaptive capabilities.

Fundamental Limits of Estimation

Despite our best efforts, cardinality estimation has fundamental theoretical limits that cannot be overcome with more statistics or better algorithms.

Inherent Estimation Challenges

•The Curse of Dimensionality — Multi-column statistics grow exponentially with columns. You can't maintain statistics for every possible column combination.
•Dynamic Data — By the time a query executes, data may have changed since statistics collection. Perfect statistics are impossible for live systems.
•Query Parameters — Prepared statements must estimate for all possible parameter values, but optimal plans differ by parameter.
•Cross-Database Queries — Federated queries can't share statistics between databases with different security domains.
•User-Defined Functions — UDFs have unknown selectivity unless explicitly registered statistics exist.
•Temporal Correlations — Time-based patterns (morning vs. evening, weekday vs. weekend) aren't captured in static statistics.

The Parameter Sensitivity Problem:

-- Same query, vastly different selectivity
SELECT * FROM orders WHERE status = @status;

@status = 'pending'   → 40% of table (4M rows)
@status = 'archived'  → 0.01% of table (1,000 rows)

Optimal plan for 'pending':   Sequential scan + hash join
Optimal plan for 'archived':  Index scan + nested loop join

The optimizer compiles ONE plan. Which parameter does it optimize for?

Solutions:
1. Parameter sniffing: Use first executed value (SQL Server, Oracle)
   → Works great until a different value runs and reuses the plan

2. Average selectivity: Estimate for 'typical' value
   → Often wrong for both extremes

3. Multiple plans: Compile different plans for different parameter ranges
   → Few databases support this well

4. OPTION (RECOMPILE): Compile fresh plan each time
   → Works but adds compilation overhead

The 'Good Enough' Philosophy

Perfect estimation is impossible. The goal is 'good enough' estimation—close enough that the optimizer picks reasonable plans. A plan within 2x of optimal is usually acceptable. Only catastrophic misestimates (10x+) require intervention.

Summary: Estimation Accuracy Mastery

Cardinality estimation is where statistics meet decision-making. Every optimization choice—join order, access method, memory allocation—depends on the optimizer's row count predictions. Accurate estimation enables brilliant query plans; inaccurate estimation causes disasters.

Key Takeaways

•Cardinality estimation predicts row counts at each plan node using selectivity formulas applied to base table statistics.
•Multiple error sources exist — stale stats, independence assumptions, missing histograms, function wrapping, and correlation blindness.
•Errors compound through joins — A 3x error per join becomes 27x error across 3 joins. Multi-table queries amplify estimation problems.
•Diagnosis uses EXPLAIN ANALYZE — Compare estimated vs. actual rows; 10x discrepancies indicate problems worth investigating.
•Mitigation starts with better statistics — Refresh stats, create extended statistics, add expression statistics before resorting to hints.
•Adaptive execution helps — Modern databases can adjust plans during execution, partially compensating for estimate errors.
•Perfect estimation is impossible — Complex correlations, dynamic data, and parameter sensitivity create inherent limits. Aim for 'good enough.'

Module Complete:

You've now explored the complete statistical foundation of query optimization:

Table Statistics — Row counts, block counts, and average widths
Column Statistics — NDV, null fractions, min/max values, correlation, and MCVs
Histograms — Distribution capture for precise range selectivity
Statistics Maintenance — Keeping statistics fresh through auto and manual processes
Estimation Accuracy — How statistics become cardinality estimates and why accuracy matters

This knowledge forms the diagnostic foundation for query performance troubleshooting. When a query is slow, you now know where to look first: Are the stats fresh? Are the estimates accurate? What's causing the discrepancy?

Module Complete

Congratulations! You've mastered Statistics and Histograms—the empirical foundation of cost-based query optimization. You can now diagnose cardinality problems, maintain statistics effectively, and understand both the power and limitations of estimation. This knowledge directly translates to solving real-world query performance problems.

5 / 5

Loading learning content...

Database Management SystemsStatistics and Histograms

Statistics and Histograms

LevelAdvanced

Duration90 mins

TopicStatistics and Histograms

5 / 5

Estimation Accuracy

The Make-or-Break Metric

What You Will Learn

Cardinality Estimation Fundamentals

How Cardinality Estimation Works:

Input: Base table statistics + Query predicates
Output: Estimated row count at each plan node

Process:
1. Start with base table cardinality (from statistics)
2. Apply selectivity factors for each predicate:
   estimated_rows = base_rows × selectivity
3. Propagate estimates up through the plan tree
4. Use join selectivity formulas for joining tables

Example:

SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'pending'
  AND o.order_date > '2024-01-01'
  AND c.country = 'US';

Base tables:
  orders: 10,000,000 rows
  customers: 1,000,000 rows

Estimation steps:
1. orders after status filter:
   sel(status='pending') = 0.20 (from MCV)
   10,000,000 × 0.20 = 2,000,000 rows

2. orders after date filter:
   sel(date > '2024-01-01') = 0.30 (from histogram)
   Combined: 2,000,000 × 0.30 = 600,000 rows
   (assumes independence)

3. customers after country filter:
   sel(country='US') = 0.40 (from MCV)
   1,000,000 × 0.40 = 400,000 rows

4. Join cardinality:
   join_sel = 1 / max(NDV(o.customer_id), NDV(c.customer_id))
   join_sel = 1 / 1,000,000 = 0.000001
   
   600,000 × 400,000 × 0.000001 = 240,000 result rows

Cardinality Formulas by Operator Type
Operator	Cardinality Formula	Key Dependencies
Table Scan	table_rows (from statistics)	Accurate row count
Selection (σ)	input_rows × selectivity	Predicate selectivity
Projection (π)	input_rows (unchanged)	None
Inner Join (⋈)	(rows_A × rows_B) × join_selectivity	Join key NDV
Left Outer Join	rows_A + (rows_A × rows_B × sel) - matched	Match rate estimation
GROUP BY	NDV of grouping columns	Multi-column NDV
DISTINCT	NDV of result columns	Multi-column NDV
UNION	rows_A + rows_B (may adjust for overlap)	Optional overlap estimation

Everything Depends on Cardinality

Sources of Estimation Error

Cardinality estimation is fundamentally approximate. Many sources of error are inherent to the statistical approach.

Major Sources of Estimation Error

•Stale Statistics — Data has changed since statistics were collected. Most common and most preventable source of error.
•Independence Assumption — Optimizers assume P(A AND B) = P(A) × P(B). Correlated columns violate this assumption dramatically.
•Uniform Distribution Assumption — Values within histogram buckets assumed uniform. Clustering within buckets causes error.
•Missing Histograms — No histogram means falling back to 1/NDV or range-based estimates that may be wildly wrong.
•Join Selectivity Estimation — Assumes uniform key distribution and containment. Real key distributions often violate both.
•Function Wrapping — WHERE YEAR(date_col) = 2024 can't use date_col statistics. Optimizer often assumes default 1% selectivity.
•Complex Predicates — WHERE col LIKE '%pattern%' or user-defined functions have unknown selectivity.
•Subquery Estimation — Correlated subqueries create complex estimation dependencies.

The Independence Problem in Detail:

-- Query with correlated columns
SELECT * FROM employees
WHERE department = 'Engineering'
  AND skill = 'Python';

-- True correlation: 90% of Python developers are in Engineering

Optimizer's calculation (independence):
  P(Engineering) = 0.30
  P(Python) = 0.20
  P(Engineering AND Python) = 0.30 × 0.20 = 0.06 (6%)

Actual probability: 0.18 (18%) — 3x underestimate!

-- The optimizer underestimates, potentially choosing:
--   - Index scan instead of table scan
--   - Nested loop instead of hash join

The Function Selectivity Trap

Error Propagation and Compounding

Individual estimation errors are problematic; error propagation through joins is catastrophic. Join cardinality estimation multiplies errors, often exponentially.

The Join Explosion:

Consider a 4-way join:
  A ⋈ B ⋈ C ⋈ D

True cardinalities:
  |A| = 100,000
  |A ⋈ B| = 10,000
  |A ⋈ B ⋈ C| = 5,000
  |A ⋈ B ⋈ C ⋈ D| = 1,000

With 2x overestimate at each join:
  |A ⋈ B|_est = 20,000            (2x off)
  |A ⋈ B ⋈ C|_est = 20,000        (4x off)
  |A ⋈ B ⋈ C ⋈ D|_est = 8,000     (8x off)

With 3x overestimate at each join:
  |A ⋈ B|_est = 30,000            (3x off)
  |A ⋈ B ⋈ C|_est = 45,000        (9x off)
  |A ⋈ B ⋈ C ⋈ D|_est = 27,000    (27x off!)

With n joins and an average per-join error factor of e, final cardinality error can be e^n. A 2-table query with 3x error stays 3x wrong. A 5-table query becomes 243x wrong.

Error Compounding Across Join Depth
Joins	2x Error/Join	3x Error/Join	5x Error/Join
1	2x	3x	5x
2	4x	9x	25x
3	8x	27x	125x
4	16x	81x	625x
5	32x	243x	3,125x

Why Join Order Matters More Than You Think:

Given severe underestimate for A ⋈ B:
  True: 1,000,000 rows
  Estimated: 1,000 rows

Optimizer's plan (based on estimate):
  Build tiny hash table for A ⋈ B (1,000 rows fits in 64KB)
  Stream C through it

What actually happens:
  Hash table needs 1,000,000 rows → 64MB
  Exceeds memory → spills to disk
  Performance: 100x slower than estimated

Better plan (if estimate was correct):
  Build hash on C (smaller table)
  Stream A ⋈ B through it
  Much better memory utilization

The Optimizer's Dilemma

Diagnosing Estimation Problems

When queries perform poorly, cardinality misestimation is often the culprit. Here's how to diagnose it systematically.

Cardinality Diagnosis Techniques
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Step 1: Get estimated vs. actual cardinalities
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'pending';
 
-- Look for discrepancies like:
-- Hash Join (cost=... rows=1000) (actual rows=150000)
--      ↑ This is the problem! 150x underestimate
 
-- Step 2: Examine per-operator estimates
EXPLAIN (ANALYZE, VERBOSE, FORMAT JSON)  -- JSON for programmatic analysis
SELECT ...;
 
-- Step 3: Check statistics on suspicious columns
SELECT 
    attname,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs,
    histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
 
-- Step 4: Test selectivity in isolation
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Compare estimated rows to: SELECT COUNT(*) FROM orders WHERE status = 'pending';
 
-- Step 5: Check for missing extended statistics
SELECT * FROM pg_statistic_ext WHERE stxrelid = 'orders'::regclass;
 
-- Step 6: Create extended statistics if correlation suspected
CREATE STATISTICS ext_orders_status_date (dependencies)
ON status, order_date FROM orders;
ANALYZE orders;
 
-- Re-run EXPLAIN ANALYZE to see if estimate improved

The 10x Rule of Thumb

Mitigating Estimation Errors

Once you've diagnosed estimation problems, several mitigation strategies can improve accuracy.

1. Refresh Stale Statistics

-- The first thing to try: fresh ANALYZE
ANALYZE affected_table;

-- With higher precision if needed
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

2. Create Extended/Multi-Column Statistics

-- PostgreSQL: Correlation-aware statistics
CREATE STATISTICS ext_status_date (dependencies)
    ON status, order_date FROM orders;

-- SQL Server: Multi-column statistics
CREATE STATISTICS stat_status_date
    ON orders (status, order_date)
    WITH FULLSCAN;

-- Oracle: Column groups
SELECT DBMS_STATS.CREATE_EXTENDED_STATS(
    'SALES', 'ORDERS', '(STATUS, ORDER_DATE)'
) FROM DUAL;

3. Create Expression Statistics

-- PostgreSQL 14+: Statistics on expressions
CREATE STATISTICS expr_year 
    ON (EXTRACT(YEAR FROM order_date)) 
    FROM orders;
ANALYZE orders;

-- Now WHERE EXTRACT(YEAR FROM order_date) = 2024
-- uses proper histogram instead of default 1%

Hints Are a Last Resort

Cardinality hints and forced plans should be used sparingly. They create maintenance burden, don't adapt to data changes, and mask underlying problems. Always prefer fixing statistics first.

Adaptive Query Execution

Modern databases incorporate adaptive execution features that can correct cardinality errors during query execution, not just during planning.

SQL Server Adaptive Query Processing (2017+):

1. Memory Grant Feedback
   - Hash joins that spill to tempdb remember the spill
   - Next execution gets more memory automatically
   - Gradually converges to optimal memory grant

2. Adaptive Joins
   - Plan includes BOTH nested loop and hash join choices
   - At runtime, chooses based on actual row count
   - If <threshold rows: nested loop; else: hash join

3. Interleaved Execution
   - Multi-statement table-valued functions
   - Execute TVF first, use actual cardinality for rest of plan

Oracle Adaptive Plans (12c+):

1. Adaptive Plan Selection
   - Plan compiled with "statistics collector" operators
   - During first execution, actual cardinality is observed
   - Plan switches sub-plans based on observed cardinality

2. SQL Plan Directives
   - When estimates are wrong, directive is created
   - Next execution uses dynamic sampling for that table/predicate
   - Automatically improves estimates without manual intervention

3. Auto-Created Extended Statistics
   - Optimizer tracks which column groups have correlated estimates  
   - Automatically creates column group statistics on problematic groups

PostgreSQL Adaptive Mechanisms:

PostgreSQL takes a more conservative approach:

1. No runtime plan switching
   - Plan is fixed at compile time
   - Philosophy: predictable execution over adaptive optimization

2. JIT Compilation Adapts to Row Count
   - Large result sets trigger JIT compilation
   - Based on actual execution, not estimates

3. Parallel Query Adaptation
   - Workers can terminate early if estimates were wrong
   - But initial worker count is based on estimates

General guidance for PostgreSQL:
  - Focus on statistics accuracy upfront
  - Use extended statistics for correlations
  - Consider pg_hint_plan extension for problem queries

The Future Is Adaptive

Fundamental Limits of Estimation

Despite our best efforts, cardinality estimation has fundamental theoretical limits that cannot be overcome with more statistics or better algorithms.

Inherent Estimation Challenges

•The Curse of Dimensionality — Multi-column statistics grow exponentially with columns. You can't maintain statistics for every possible column combination.
•Dynamic Data — By the time a query executes, data may have changed since statistics collection. Perfect statistics are impossible for live systems.
•Query Parameters — Prepared statements must estimate for all possible parameter values, but optimal plans differ by parameter.
•Cross-Database Queries — Federated queries can't share statistics between databases with different security domains.
•User-Defined Functions — UDFs have unknown selectivity unless explicitly registered statistics exist.
•Temporal Correlations — Time-based patterns (morning vs. evening, weekday vs. weekend) aren't captured in static statistics.

The Parameter Sensitivity Problem:

-- Same query, vastly different selectivity
SELECT * FROM orders WHERE status = @status;

@status = 'pending'   → 40% of table (4M rows)
@status = 'archived'  → 0.01% of table (1,000 rows)

Optimal plan for 'pending':   Sequential scan + hash join
Optimal plan for 'archived':  Index scan + nested loop join

The optimizer compiles ONE plan. Which parameter does it optimize for?

Solutions:
1. Parameter sniffing: Use first executed value (SQL Server, Oracle)
   → Works great until a different value runs and reuses the plan

2. Average selectivity: Estimate for 'typical' value
   → Often wrong for both extremes

3. Multiple plans: Compile different plans for different parameter ranges
   → Few databases support this well

4. OPTION (RECOMPILE): Compile fresh plan each time
   → Works but adds compilation overhead

The 'Good Enough' Philosophy

Summary: Estimation Accuracy Mastery

Key Takeaways

•Cardinality estimation predicts row counts at each plan node using selectivity formulas applied to base table statistics.
•Multiple error sources exist — stale stats, independence assumptions, missing histograms, function wrapping, and correlation blindness.
•Errors compound through joins — A 3x error per join becomes 27x error across 3 joins. Multi-table queries amplify estimation problems.
•Diagnosis uses EXPLAIN ANALYZE — Compare estimated vs. actual rows; 10x discrepancies indicate problems worth investigating.
•Mitigation starts with better statistics — Refresh stats, create extended statistics, add expression statistics before resorting to hints.
•Adaptive execution helps — Modern databases can adjust plans during execution, partially compensating for estimate errors.
•Perfect estimation is impossible — Complex correlations, dynamic data, and parameter sensitivity create inherent limits. Aim for 'good enough.'

Module Complete:

You've now explored the complete statistical foundation of query optimization:

Table Statistics — Row counts, block counts, and average widths
Column Statistics — NDV, null fractions, min/max values, correlation, and MCVs
Histograms — Distribution capture for precise range selectivity
Statistics Maintenance — Keeping statistics fresh through auto and manual processes
Estimation Accuracy — How statistics become cardinality estimates and why accuracy matters

Module Complete

5 / 5