Database Management SystemsQuery Processing Overview

Query Processing Steps

LevelIntermediate

Duration60 mins

TopicQuery Processing Overview

4 / 5

Optimization - Finding the Best Execution Strategy

The Quest for Efficiency

The logical query plan tells us what to compute. But there are countless ways to compute it. A simple three-table join can be executed in dozens of different ways—different join orders, different join algorithms, different access methods for each table. The difference between the best and worst plan can be orders of magnitude: seconds versus hours.

Query optimization is the process of finding an efficient execution strategy from the space of equivalent alternatives. It's one of the most intellectually challenging and practically important components of database systems—the reason modern databases can efficiently process queries against terabyte-scale datasets.

This is where database systems truly shine. A declarative query becomes an optimized execution plan, transforming user intent into machine efficiency.

What You Will Learn

By the end of this page, you will understand the query optimizer's goals and constraints, heuristic and rule-based optimization techniques, cost-based optimization principles, how statistics and cardinality estimation drive plan selection, and the exponential search space that makes optimization challenging.

Why Optimization Matters

To appreciate query optimization, consider how dramatically execution strategies can differ in performance.

Example: Simple Join Performance Variance

Consider joining two tables:

customers — 100,000 rows
orders — 10,000,000 rows (100 orders per customer)

SELECT c.name, COUNT(*)
FROM   customers c JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;

Execution Strategy Performance Comparison
Strategy	Approach	Operations	Estimated Time
Nested Loop (bad order)	For each order, scan customers	10M × 100K = 1 trillion comparisons	Hours
Nested Loop (good order)	For each customer, index lookup orders	100K × log(10M) = 2.4M operations	Seconds
Hash Join	Build hash on customers, probe with orders	100K + 10M = 10.1M operations	Seconds
Merge Join (presorted)	Merge sorted streams	100K + 10M = 10.1M operations	Seconds

The performance gap can be 10,000× or more. A few seconds versus many hours. The optimizer's job is finding those efficient strategies automatically.

The Optimization Challenge:

Equivalence — Many different plans produce identical results
Exponential Space — The number of equivalent plans grows exponentially with query complexity
Cost Estimation — We must predict performance without actually running each plan
Time Constraints — Optimization itself takes time; we can't search forever
Changing Data — Data distributions change; yesterday's best plan isn't tomorrow's

Optimization Philosophy

Optimizers don't find the theoretically optimal plan—that's often computationally infeasible. They find a 'good enough' plan quickly. A plan that's 10% slower but found in 1ms is better than the optimal plan found after 10 seconds of optimization.

The Optimization Pipeline

Query optimization typically proceeds in stages, from fast heuristic transformations to more expensive cost-based searches.

Multi-Stage Optimization:

Converting Mermaid diagram...

Optimization Stages

•Query Rewrite — Transform query into equivalent forms using algebraic rules. Expand views, simplify expressions, push down predicates.
•Heuristic Optimization — Apply rule-based transformations that are almost always beneficial. Selection pushdown, constant folding, predicate simplification.
•Cost-Based Optimization — Enumerate alternative plans, estimate costs using statistics, and select the cheapest. This is the core optimization work.
•Physical Plan Selection — Choose specific algorithms (hash join vs. merge join), access methods (index scan vs. sequential), and parallelism levels.
•Plan Finalization — Add concrete implementation details, memory allocations, and prepare for execution.

Two Schools of Optimization:

Approach	Description	Examples
Rule-Based (RBO)	Apply fixed rules based on query structure. No statistics needed. Simpler but less adaptive.	Oracle (legacy), older systems
Cost-Based (CBO)	Estimate costs using statistics, choose minimum-cost plan. More accurate but complex.	PostgreSQL, modern MySQL, SQL Server

Modern systems are predominantly cost-based, but still apply heuristic rules for transformations that are universally beneficial.

Heuristic and Rule-Based Optimization

Before cost-based optimization, the optimizer applies heuristic transformations—changes that are almost always beneficial regardless of data distribution.

Key Heuristic Rules:

Common Heuristic Transformations

•Selection Pushdown — Move WHERE predicates as close to base tables as possible. Filter early to reduce intermediate result sizes.
•Projection Pushdown — Eliminate columns not needed by upper operators. Reduces tuple width and memory usage.
•Predicate Simplification — Simplify expressions like x > 5 AND x > 3 to x > 5, or 1 = 1 to TRUE.
•Constant Folding — Evaluate constant expressions at compile time: 5 + 3 becomes 8.
•Join Elimination — Remove joins that don't affect results (e.g., join to a table whose columns aren't used).
•Subquery Flattening — Convert correlated subqueries to joins where possible.
•View Merging — Merge view definitions with the containing query for better optimization opportunities.

heuristic_transformations.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Original query
SELECT o.order_id, c.name
FROM   orders o, customers c
WHERE  o.customer_id = c.id
  AND  c.region = 'WEST'
  AND  o.year = 2024
  AND  o.total > 1000;
 
-- After selection pushdown (conceptual):
-- Predicates pushed to their respective tables
 
-- Scan customers with: region = 'WEST'
-- Scan orders with: year = 2024 AND total > 1000
-- Then join on: customer_id = id
 
-- This reduces:
-- - Rows read from customers (only WEST)
-- - Rows read from orders (only 2024, >$1000)
-- - Join input sizes (fewer rows to join)
 
-- Before pushdown: join ALL rows, then filter
-- After pushdown: filter FIRST, then join (much smaller)
 
-- Example of constant folding and simplification
SELECT *
FROM   products
WHERE  price * 1.1 > 100       -- Rewritten to: price > 90.909...
  AND  category = 'A'
  AND  (active = true OR 1 = 0) -- Simplified to: active = true
  AND  stock + 0 > 5;           -- Simplified to: stock > 5

Selection Pushdown is Critical

Selection pushdown is the single most impactful heuristic. Filtering early can reduce data volumes by orders of magnitude. A predicate that eliminates 99% of rows should be applied as early as possible, not after expensive joins.

Cost-Based Optimization Principles

Cost-based optimization (CBO) is the heart of modern query optimization. The optimizer generates candidate plans, estimates each plan's cost, and selects the lowest-cost alternative.

The Cost Model:

Plan cost is typically measured in abstract units combining:

Cost Component	What It Measures	Typical Weight
I/O Cost	Disk page reads/writes	Heavy (disk is slow)
CPU Cost	Tuple processing, comparisons	Moderate
Memory Cost	Buffer usage, sorting, hashing	Considered via algorithms
Network Cost	Data transfer (distributed DBs)	Heavy for distributed

Most optimizers focus primarily on I/O cost, as disk access dominates runtime for data-intensive queries.

Cost Estimation Formulas (Simplified):

Sequential Scan Cost:
  cost = (pages_in_table × seq_page_cost) + (rows × cpu_tuple_cost)

Index Scan Cost:
  cost = (index_pages × random_page_cost) + 
         (matching_rows × index_tuple_cost) + 
         (heap_fetches × random_page_cost)

Nested Loop Join Cost:
  cost = outer_cost + (outer_rows × inner_cost)

Hash Join Cost:
  cost = outer_cost + inner_cost + (hash_build_cost) + (probe_cost)

Merge Join Cost:
  cost = outer_cost + inner_cost + sort_cost_if_needed + merge_cost

These formulas require knowing row counts, page counts, and various tuning parameters.

cost_estimation_explain.sql
PostgreSQL EXPLAIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- PostgreSQL cost estimates in EXPLAIN output
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
 
/*
                                      QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using orders_customer_id_idx on orders  
   (cost=0.43..23.45 rows=12 width=84)
       ↑       ↑         ↑     ↑
       |       |         |     └─ Estimated row width (bytes)
       |       |         └─ Estimated rows returned
       |       └─ Total cost (to get all rows)
       └─ Startup cost (before first row)
 
 Index Cond: (customer_id = 12345)
*/
 
-- More complex query with cost breakdown
EXPLAIN (COSTS ON, FORMAT TEXT) 
SELECT c.name, SUM(o.total) 
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'WEST'
GROUP BY c.name;
 
/*
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 HashAggregate  (cost=4523.45..4528.45 rows=500 width=40)
   Group Key: c.name
   ->  Hash Join  (cost=125.00..4498.45 rows=5000 width=40)
         Hash Cond: (o.customer_id = c.id)
         ->  Seq Scan on orders o  (cost=0.00..3500.00 rows=100000 width=12)
         ->  Hash  (cost=100.00..100.00 rows=2000 width=36)
               ->  Seq Scan on customers c  (cost=0.00..100.00 rows=2000 width=36)
                     Filter: (region = 'WEST'::text)
*/

Statistics and Cardinality Estimation

Cost-based optimization requires statistics about the data. Without knowing how many rows match a predicate, cost estimates are guesswork.

Key Statistics Maintained:

Database Statistics for Optimization
Statistic	Description	Used For
Row Count (n)	Total rows in table	Base cardinality
Distinct Values (NDV)	Unique values per column	Selectivity estimation
NULL Fraction	Percentage of NULL values	IS NULL predicates
Min/Max Values	Column value range	Range predicates
Histograms	Value distribution buckets	Non-uniform distributions
Correlation	Physical order vs. logical order	Index scan efficiency
Average Width	Average bytes per column	Memory estimation
Page Count	Disk pages used	I/O cost estimation

Cardinality Estimation:

The optimizer must estimate how many rows each operation produces. This is cardinality estimation and it's notoriously difficult.

Selectivity Formulas:

Equality predicate (column = value):
  selectivity = 1 / distinct_values   (if uniform distribution)
  
Range predicate (column > value):
  selectivity = (max - value) / (max - min)
  
LIKE prefix (column LIKE 'abc%'):
  selectivity = 1 / (distinct_values^0.3)  (heuristic)
  
AND combination:
  selectivity = sel1 × sel2   (assumes independence)
  
OR combination:
  selectivity = sel1 + sel2 - (sel1 × sel2)
  
Join result:
  cardinality = (rows1 × rows2) / max(distinct1, distinct2)

The independence assumption (AND) is often wrong—real data has correlations. This is a major source of estimation errors.

statistics_commands.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- View table statistics
SELECT relname, reltuples, relpages
FROM   pg_class
WHERE  relname = 'orders';
 
-- View column statistics  
SELECT attname, n_distinct, null_frac, avg_width,
       correlation, most_common_vals, most_common_freqs,
       histogram_bounds
FROM   pg_stats
WHERE  tablename = 'orders'
  AND  attname = 'status';
 
-- Update statistics (analyze table)
ANALYZE orders;
 
-- Analyze specific columns with extended statistics
ANALYZE orders (customer_id, order_date, status);
 
-- Create extended statistics for correlated columns
CREATE STATISTICS orders_customer_date (dependencies)
ON customer_id, order_date FROM orders;
 
-- Force sample size for accuracy
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.01);

Stale Statistics = Bad Plans

When statistics become stale (after many inserts/updates/deletes), the optimizer makes poor decisions based on outdated information. Regular ANALYZE operations are essential for production systems. Many databases auto-analyze, but may not do so frequently enough for rapidly changing tables.

The Search Space Problem

The optimizer faces a combinatorial explosion: the number of possible execution plans grows exponentially with query complexity.

Join Order Explosion:

For a query joining n tables, the number of possible join orders is:

Without considering commutativity: n!
With bushy trees allowed: even more

Tables (n)	Join Orders	Approximate Count
3	3! = 6	6
5	5! = 120	120
10	10! = 3.6M	3,628,800
15	15!	1.3 trillion
20	20!	2.4 × 10^18

For each join order, we must also choose join algorithms, access methods, and parallelism levels. The complete search space is astronomical.

Search Strategies:

Optimizers use various techniques to explore this space efficiently:

Strategy	Description	Pros/Cons
Exhaustive Search	Examine all plans	Optimal but infeasible for large queries
Dynamic Programming	Build optimal subplans, combine them	Good for left-deep trees, manageable space
Greedy Heuristics	Make locally optimal choices	Fast but may miss global optimum
Randomized Search	Random exploration with local optimization	Good for very large queries
Genetic Algorithms	Evolutionary approach to plan search	Used in PostgreSQL for many-table queries

join_order_search.pseudo

Dynamic Programming

// Selinger-style dynamic programming optimizer (System R)
// Finds optimal left-deep join trees
 
function find_best_plan(tables):
    // Phase 1: Find best single-table access for each table
    for each table T in tables:
        for each access method A (seq scan, index scan, etc.):
            cost = estimate_cost(A, T)
            if cost < best_cost[{T}]:
                best_plan[{T}] = create_plan(A, T)
                best_cost[{T}] = cost
    
    // Phase 2: Build up join plans incrementally
    for size = 2 to |tables|:
        for each subset S of tables with |S| = size:
            for each way to partition S into (R, {T}) where T is single table:
                if join_predicate_exists(R, T):
                    for each join method J (nested loop, hash, merge):
                        left_plan = best_plan[R]
                        cost = best_cost[R] + join_cost(J, left_plan, T)
                        if cost < best_cost[S]:
                            best_plan[S] = create_join_plan(J, left_plan, T)
                            best_cost[S] = cost
    
    return best_plan[tables]
 
// Complexity: O(3^n) for n tables - still exponential but manageable
// For n=10: ~59,000 subsets to consider (much less than 10! = 3.6M)

PostgreSQL's GEQO

PostgreSQL uses dynamic programming for queries with up to 12 tables (configurable via geqo_threshold). Beyond that, it switches to GEQO (Genetic Query Optimizer), which uses genetic algorithms to search the plan space. This trades optimality for speed on very complex queries.

Physical Operator Selection

Logical operators (join, select, project) must be mapped to physical operators—specific algorithms that implement the logical operation.

Access Method Selection:

Physical Operator Choices
Logical Operation	Physical Options	Selection Criteria
Table Scan	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Scan	Selectivity, index availability, clustering
Join	Nested Loop, Hash Join, Merge Join	Table sizes, join type, available memory, indexes
Aggregation	Plain Agg, Sorted Agg, Hash Agg	Distinct values, available memory, sort order
Sort	Quicksort, External Merge Sort, Index Scan	Data size, memory, existing sort order
Distinct	Sort + Unique, Hash Unique	Distinct count, memory, input ordering

Join Algorithm Selection Heuristics:

Join Type	Best When...
Nested Loop	Inner side is small or indexed; outer is highly selective
Hash Join	No useful indexes; equi-join; one side fits in memory
Merge Join	Inputs already sorted; equi-join; sort can be reused

Index Selection:

When multiple indexes exist, the optimizer evaluates which (if any) to use:

-- Table 'orders' has indexes on:
-- - order_id (primary key)
-- - customer_id
-- - order_date
-- - (customer_id, order_date) composite

SELECT * FROM orders WHERE customer_id = 100;
-- Uses: customer_id index (direct match)

SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Uses: order_date index (direct match)

SELECT * FROM orders WHERE customer_id = 100 AND order_date = '2024-01-15';
-- Uses: composite index (matches both columns)

SELECT * FROM orders WHERE order_date > '2024-01-01';
-- May use: sequential scan (if range is large)
-- Or: order_date index (if range is small)

operator_selection_explain.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Compare physical operator choices with forced methods
 
-- Default (optimizer chooses)
EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
 
-- Force nested loop
SET enable_hashjoin = off;
SET enable_mergejoin = off;
EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
 
-- Force hash join
SET enable_hashjoin = on;
SET enable_mergejoin = off;
SET enable_nestloop = off;
EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
 
-- Reset to default
RESET enable_hashjoin;
RESET enable_mergejoin;
RESET enable_nestloop;
 
-- Force specific scan method
SET enable_seqscan = off;  -- Force index usage if possible
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
RESET enable_seqscan;

Plan Caching and Reuse

Optimization is expensive. Databases cache execution plans to avoid re-optimizing identical or similar queries.

Plan Cache Strategies:

Plan Caching Approaches

•Prepared Statement Caching — Plans for PREPARE statements are cached within the session. Execute with different parameters reuses the plan.
•Plan Cache by Query Text — Identical query text shares a cached plan. Requires exact text match (whitespace, case).
•Parameterized Plan Cache — Query text is normalized (literals replaced with $1, $2), enabling plan sharing across different literal values.
•Adaptive Planning — Monitor execution and re-plan if actual cardinalities differ significantly from estimates (adaptive query execution).

The Generic Plan Problem:

Prepared statements face a dilemma:

Custom plans per parameter value are accurate but require optimization each time
Generic plans reused for all parameter values avoid optimization but may be suboptimal for specific values

-- Prepared statement
PREPARE find_orders(text) AS SELECT * FROM orders WHERE status = $1;

-- With status = 'pending' (rare, 0.1% of rows): Index scan is best
-- With status = 'completed' (common, 80% of rows): Seq scan is best

-- A generic plan must choose ONE approach.
-- PostgreSQL uses custom plans for first 5 executions, then evaluates
-- whether a generic plan is acceptable.

plan_cache_management.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- View prepared statement plan (after several executions)
PREPARE get_orders(int) AS SELECT * FROM orders WHERE customer_id = $1;
 
-- Execute multiple times to build statistics
EXECUTE get_orders(100);
EXECUTE get_orders(200);
EXECUTE get_orders(300);
 
-- View cached plan info
SELECT name, statement, 
       generic_plans, custom_plans,
       total_exec_time
FROM pg_prepared_statements;
 
-- Force generic plan behavior
SET plan_cache_mode = 'force_generic_plan';
 
-- Force custom plan behavior  
SET plan_cache_mode = 'force_custom_plan';
 
-- Reset to auto
SET plan_cache_mode = 'auto';

Parameter Sniffing

Parameter sniffing occurs when the first execution's parameter values produce a plan that's suboptimal for subsequent values. If the first call uses status='pending' (rare), the cached plan uses index scan—but then status='completed' (common) also gets index scan, causing thousands of random I/Os instead of a fast sequential scan.

Summary: Optimization Finds Efficient Paths

We've explored query optimization comprehensively. Let's consolidate the key concepts:

Key Takeaways

•Optimization Impact — The difference between good and bad plans can be orders of magnitude (seconds vs. hours). Automatic optimization is essential.
•Multi-Stage Pipeline — Optimization proceeds from heuristic rewrites through cost-based search to physical operator selection.
•Heuristic Rules — Selection pushdown, constant folding, and predicate simplification are universally beneficial transformations.
•Cost-Based Selection — Plans are evaluated using cost models that estimate I/O, CPU, and memory usage based on statistics.
•Statistics Are Critical — Cardinality estimation depends on accurate, up-to-date statistics. Stale statistics produce bad plans.
•Exponential Search Space — Join order enumeration explodes combinatorially; dynamic programming and heuristics make it tractable.
•Physical Operators — Logical operations map to specific algorithms (hash join, merge join, index scan) based on data characteristics.
•Plan Caching — Reusing plans avoids optimization overhead but risks using suboptimal plans for varied parameter values.

What's Next:

With an optimized execution plan in hand, we're ready for the final phase: execution. The next page examines how the query executor actually processes data—the iterator model, pipelining, materialization, and the machinery that transforms plans into results.

Page Complete

You now understand how query optimizers transform logical plans into efficient physical plans through heuristic and cost-based techniques. This foundation prepares you for understanding query execution—where plans become results.

4 / 5

Loading learning content...

Database Management SystemsQuery Processing Overview

Query Processing Steps

LevelIntermediate

Duration60 mins

TopicQuery Processing Overview

4 / 5

Optimization - Finding the Best Execution Strategy

The Quest for Efficiency

This is where database systems truly shine. A declarative query becomes an optimized execution plan, transforming user intent into machine efficiency.

What You Will Learn

Why Optimization Matters

To appreciate query optimization, consider how dramatically execution strategies can differ in performance.

Example: Simple Join Performance Variance

Consider joining two tables:

customers — 100,000 rows
orders — 10,000,000 rows (100 orders per customer)

SELECT c.name, COUNT(*)
FROM   customers c JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;

Execution Strategy Performance Comparison
Strategy	Approach	Operations	Estimated Time
Nested Loop (bad order)	For each order, scan customers	10M × 100K = 1 trillion comparisons	Hours
Nested Loop (good order)	For each customer, index lookup orders	100K × log(10M) = 2.4M operations	Seconds
Hash Join	Build hash on customers, probe with orders	100K + 10M = 10.1M operations	Seconds
Merge Join (presorted)	Merge sorted streams	100K + 10M = 10.1M operations	Seconds

The performance gap can be 10,000× or more. A few seconds versus many hours. The optimizer's job is finding those efficient strategies automatically.

The Optimization Challenge:

Equivalence — Many different plans produce identical results
Exponential Space — The number of equivalent plans grows exponentially with query complexity
Cost Estimation — We must predict performance without actually running each plan
Time Constraints — Optimization itself takes time; we can't search forever
Changing Data — Data distributions change; yesterday's best plan isn't tomorrow's

Optimization Philosophy

The Optimization Pipeline

Query optimization typically proceeds in stages, from fast heuristic transformations to more expensive cost-based searches.

Multi-Stage Optimization:

Converting Mermaid diagram...

Optimization Stages

•Query Rewrite — Transform query into equivalent forms using algebraic rules. Expand views, simplify expressions, push down predicates.
•Heuristic Optimization — Apply rule-based transformations that are almost always beneficial. Selection pushdown, constant folding, predicate simplification.
•Cost-Based Optimization — Enumerate alternative plans, estimate costs using statistics, and select the cheapest. This is the core optimization work.
•Physical Plan Selection — Choose specific algorithms (hash join vs. merge join), access methods (index scan vs. sequential), and parallelism levels.
•Plan Finalization — Add concrete implementation details, memory allocations, and prepare for execution.

Two Schools of Optimization:

Approach	Description	Examples
Rule-Based (RBO)	Apply fixed rules based on query structure. No statistics needed. Simpler but less adaptive.	Oracle (legacy), older systems
Cost-Based (CBO)	Estimate costs using statistics, choose minimum-cost plan. More accurate but complex.	PostgreSQL, modern MySQL, SQL Server

Modern systems are predominantly cost-based, but still apply heuristic rules for transformations that are universally beneficial.

Heuristic and Rule-Based Optimization

Before cost-based optimization, the optimizer applies heuristic transformations—changes that are almost always beneficial regardless of data distribution.

Key Heuristic Rules:

Common Heuristic Transformations

•Selection Pushdown — Move WHERE predicates as close to base tables as possible. Filter early to reduce intermediate result sizes.
•Projection Pushdown — Eliminate columns not needed by upper operators. Reduces tuple width and memory usage.
•Predicate Simplification — Simplify expressions like x > 5 AND x > 3 to x > 5, or 1 = 1 to TRUE.
•Constant Folding — Evaluate constant expressions at compile time: 5 + 3 becomes 8.
•Join Elimination — Remove joins that don't affect results (e.g., join to a table whose columns aren't used).
•Subquery Flattening — Convert correlated subqueries to joins where possible.
•View Merging — Merge view definitions with the containing query for better optimization opportunities.

heuristic_transformations.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Original query
SELECT o.order_id, c.name
FROM   orders o, customers c
WHERE  o.customer_id = c.id
  AND  c.region = 'WEST'
  AND  o.year = 2024
  AND  o.total > 1000;
 
-- After selection pushdown (conceptual):
-- Predicates pushed to their respective tables
 
-- Scan customers with: region = 'WEST'
-- Scan orders with: year = 2024 AND total > 1000
-- Then join on: customer_id = id
 
-- This reduces:
-- - Rows read from customers (only WEST)
-- - Rows read from orders (only 2024, >$1000)
-- - Join input sizes (fewer rows to join)
 
-- Before pushdown: join ALL rows, then filter
-- After pushdown: filter FIRST, then join (much smaller)
 
-- Example of constant folding and simplification
SELECT *
FROM   products
WHERE  price * 1.1 > 100       -- Rewritten to: price > 90.909...
  AND  category = 'A'
  AND  (active = true OR 1 = 0) -- Simplified to: active = true
  AND  stock + 0 > 5;           -- Simplified to: stock > 5

Selection Pushdown is Critical

Cost-Based Optimization Principles

Cost-based optimization (CBO) is the heart of modern query optimization. The optimizer generates candidate plans, estimates each plan's cost, and selects the lowest-cost alternative.

The Cost Model:

Plan cost is typically measured in abstract units combining:

Cost Component	What It Measures	Typical Weight
I/O Cost	Disk page reads/writes	Heavy (disk is slow)
CPU Cost	Tuple processing, comparisons	Moderate
Memory Cost	Buffer usage, sorting, hashing	Considered via algorithms
Network Cost	Data transfer (distributed DBs)	Heavy for distributed

Most optimizers focus primarily on I/O cost, as disk access dominates runtime for data-intensive queries.

Cost Estimation Formulas (Simplified):

Sequential Scan Cost:
  cost = (pages_in_table × seq_page_cost) + (rows × cpu_tuple_cost)

Index Scan Cost:
  cost = (index_pages × random_page_cost) + 
         (matching_rows × index_tuple_cost) + 
         (heap_fetches × random_page_cost)

Nested Loop Join Cost:
  cost = outer_cost + (outer_rows × inner_cost)

Hash Join Cost:
  cost = outer_cost + inner_cost + (hash_build_cost) + (probe_cost)

Merge Join Cost:
  cost = outer_cost + inner_cost + sort_cost_if_needed + merge_cost

These formulas require knowing row counts, page counts, and various tuning parameters.

cost_estimation_explain.sql
PostgreSQL EXPLAIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- PostgreSQL cost estimates in EXPLAIN output
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
 
/*
                                      QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using orders_customer_id_idx on orders  
   (cost=0.43..23.45 rows=12 width=84)
       ↑       ↑         ↑     ↑
       |       |         |     └─ Estimated row width (bytes)
       |       |         └─ Estimated rows returned
       |       └─ Total cost (to get all rows)
       └─ Startup cost (before first row)
 
 Index Cond: (customer_id = 12345)
*/
 
-- More complex query with cost breakdown
EXPLAIN (COSTS ON, FORMAT TEXT) 
SELECT c.name, SUM(o.total) 
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'WEST'
GROUP BY c.name;
 
/*
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 HashAggregate  (cost=4523.45..4528.45 rows=500 width=40)
   Group Key: c.name
   ->  Hash Join  (cost=125.00..4498.45 rows=5000 width=40)
         Hash Cond: (o.customer_id = c.id)
         ->  Seq Scan on orders o  (cost=0.00..3500.00 rows=100000 width=12)
         ->  Hash  (cost=100.00..100.00 rows=2000 width=36)
               ->  Seq Scan on customers c  (cost=0.00..100.00 rows=2000 width=36)
                     Filter: (region = 'WEST'::text)
*/

Statistics and Cardinality Estimation

Cost-based optimization requires statistics about the data. Without knowing how many rows match a predicate, cost estimates are guesswork.

Key Statistics Maintained:

Database Statistics for Optimization
Statistic	Description	Used For
Row Count (n)	Total rows in table	Base cardinality
Distinct Values (NDV)	Unique values per column	Selectivity estimation
NULL Fraction	Percentage of NULL values	IS NULL predicates
Min/Max Values	Column value range	Range predicates
Histograms	Value distribution buckets	Non-uniform distributions
Correlation	Physical order vs. logical order	Index scan efficiency
Average Width	Average bytes per column	Memory estimation
Page Count	Disk pages used	I/O cost estimation

Cardinality Estimation:

The optimizer must estimate how many rows each operation produces. This is cardinality estimation and it's notoriously difficult.

Selectivity Formulas:

Equality predicate (column = value):
  selectivity = 1 / distinct_values   (if uniform distribution)
  
Range predicate (column > value):
  selectivity = (max - value) / (max - min)
  
LIKE prefix (column LIKE 'abc%'):
  selectivity = 1 / (distinct_values^0.3)  (heuristic)
  
AND combination:
  selectivity = sel1 × sel2   (assumes independence)
  
OR combination:
  selectivity = sel1 + sel2 - (sel1 × sel2)
  
Join result:
  cardinality = (rows1 × rows2) / max(distinct1, distinct2)

The independence assumption (AND) is often wrong—real data has correlations. This is a major source of estimation errors.

statistics_commands.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- View table statistics
SELECT relname, reltuples, relpages
FROM   pg_class
WHERE  relname = 'orders';
 
-- View column statistics  
SELECT attname, n_distinct, null_frac, avg_width,
       correlation, most_common_vals, most_common_freqs,
       histogram_bounds
FROM   pg_stats
WHERE  tablename = 'orders'
  AND  attname = 'status';
 
-- Update statistics (analyze table)
ANALYZE orders;
 
-- Analyze specific columns with extended statistics
ANALYZE orders (customer_id, order_date, status);
 
-- Create extended statistics for correlated columns
CREATE STATISTICS orders_customer_date (dependencies)
ON customer_id, order_date FROM orders;
 
-- Force sample size for accuracy
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.01);

Stale Statistics = Bad Plans

The Search Space Problem

The optimizer faces a combinatorial explosion: the number of possible execution plans grows exponentially with query complexity.

Join Order Explosion:

For a query joining n tables, the number of possible join orders is:

Without considering commutativity: n!
With bushy trees allowed: even more

Tables (n)	Join Orders	Approximate Count
3	3! = 6	6
5	5! = 120	120
10	10! = 3.6M	3,628,800
15	15!	1.3 trillion
20	20!	2.4 × 10^18

For each join order, we must also choose join algorithms, access methods, and parallelism levels. The complete search space is astronomical.

Search Strategies:

Optimizers use various techniques to explore this space efficiently:

Strategy	Description	Pros/Cons
Exhaustive Search	Examine all plans	Optimal but infeasible for large queries
Dynamic Programming	Build optimal subplans, combine them	Good for left-deep trees, manageable space
Greedy Heuristics	Make locally optimal choices	Fast but may miss global optimum
Randomized Search	Random exploration with local optimization	Good for very large queries
Genetic Algorithms	Evolutionary approach to plan search	Used in PostgreSQL for many-table queries

join_order_search.pseudo

Dynamic Programming

// Selinger-style dynamic programming optimizer (System R)
// Finds optimal left-deep join trees
 
function find_best_plan(tables):
    // Phase 1: Find best single-table access for each table
    for each table T in tables:
        for each access method A (seq scan, index scan, etc.):
            cost = estimate_cost(A, T)
            if cost < best_cost[{T}]:
                best_plan[{T}] = create_plan(A, T)
                best_cost[{T}] = cost
    
    // Phase 2: Build up join plans incrementally
    for size = 2 to |tables|:
        for each subset S of tables with |S| = size:
            for each way to partition S into (R, {T}) where T is single table:
                if join_predicate_exists(R, T):
                    for each join method J (nested loop, hash, merge):
                        left_plan = best_plan[R]
                        cost = best_cost[R] + join_cost(J, left_plan, T)
                        if cost < best_cost[S]:
                            best_plan[S] = create_join_plan(J, left_plan, T)
                            best_cost[S] = cost
    
    return best_plan[tables]
 
// Complexity: O(3^n) for n tables - still exponential but manageable
// For n=10: ~59,000 subsets to consider (much less than 10! = 3.6M)

PostgreSQL's GEQO

Physical Operator Selection

Logical operators (join, select, project) must be mapped to physical operators—specific algorithms that implement the logical operation.

Access Method Selection:

Physical Operator Choices
Logical Operation	Physical Options	Selection Criteria
Table Scan	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Scan	Selectivity, index availability, clustering
Join	Nested Loop, Hash Join, Merge Join	Table sizes, join type, available memory, indexes
Aggregation	Plain Agg, Sorted Agg, Hash Agg	Distinct values, available memory, sort order
Sort	Quicksort, External Merge Sort, Index Scan	Data size, memory, existing sort order
Distinct	Sort + Unique, Hash Unique	Distinct count, memory, input ordering

Join Algorithm Selection Heuristics:

Join Type	Best When...
Nested Loop	Inner side is small or indexed; outer is highly selective
Hash Join	No useful indexes; equi-join; one side fits in memory
Merge Join	Inputs already sorted; equi-join; sort can be reused

Index Selection:

When multiple indexes exist, the optimizer evaluates which (if any) to use:

-- Table 'orders' has indexes on:
-- - order_id (primary key)
-- - customer_id
-- - order_date
-- - (customer_id, order_date) composite

SELECT * FROM orders WHERE customer_id = 100;
-- Uses: customer_id index (direct match)

SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Uses: order_date index (direct match)

SELECT * FROM orders WHERE customer_id = 100 AND order_date = '2024-01-15';
-- Uses: composite index (matches both columns)

SELECT * FROM orders WHERE order_date > '2024-01-01';
-- May use: sequential scan (if range is large)
-- Or: order_date index (if range is small)

operator_selection_explain.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Compare physical operator choices with forced methods
 
-- Default (optimizer chooses)
EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
 
-- Force nested loop
SET enable_hashjoin = off;
SET enable_mergejoin = off;
EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
 
-- Force hash join
SET enable_hashjoin = on;
SET enable_mergejoin = off;
SET enable_nestloop = off;
EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id;
 
-- Reset to default
RESET enable_hashjoin;
RESET enable_mergejoin;
RESET enable_nestloop;
 
-- Force specific scan method
SET enable_seqscan = off;  -- Force index usage if possible
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
RESET enable_seqscan;

Plan Caching and Reuse

Optimization is expensive. Databases cache execution plans to avoid re-optimizing identical or similar queries.

Plan Cache Strategies:

Plan Caching Approaches

•Prepared Statement Caching — Plans for PREPARE statements are cached within the session. Execute with different parameters reuses the plan.
•Plan Cache by Query Text — Identical query text shares a cached plan. Requires exact text match (whitespace, case).
•Parameterized Plan Cache — Query text is normalized (literals replaced with $1, $2), enabling plan sharing across different literal values.
•Adaptive Planning — Monitor execution and re-plan if actual cardinalities differ significantly from estimates (adaptive query execution).

The Generic Plan Problem:

Prepared statements face a dilemma:

Custom plans per parameter value are accurate but require optimization each time
Generic plans reused for all parameter values avoid optimization but may be suboptimal for specific values

-- Prepared statement
PREPARE find_orders(text) AS SELECT * FROM orders WHERE status = $1;

-- With status = 'pending' (rare, 0.1% of rows): Index scan is best
-- With status = 'completed' (common, 80% of rows): Seq scan is best

-- A generic plan must choose ONE approach.
-- PostgreSQL uses custom plans for first 5 executions, then evaluates
-- whether a generic plan is acceptable.

plan_cache_management.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- View prepared statement plan (after several executions)
PREPARE get_orders(int) AS SELECT * FROM orders WHERE customer_id = $1;
 
-- Execute multiple times to build statistics
EXECUTE get_orders(100);
EXECUTE get_orders(200);
EXECUTE get_orders(300);
 
-- View cached plan info
SELECT name, statement, 
       generic_plans, custom_plans,
       total_exec_time
FROM pg_prepared_statements;
 
-- Force generic plan behavior
SET plan_cache_mode = 'force_generic_plan';
 
-- Force custom plan behavior  
SET plan_cache_mode = 'force_custom_plan';
 
-- Reset to auto
SET plan_cache_mode = 'auto';

Parameter Sniffing

Summary: Optimization Finds Efficient Paths

We've explored query optimization comprehensively. Let's consolidate the key concepts:

Key Takeaways

•Optimization Impact — The difference between good and bad plans can be orders of magnitude (seconds vs. hours). Automatic optimization is essential.
•Multi-Stage Pipeline — Optimization proceeds from heuristic rewrites through cost-based search to physical operator selection.
•Heuristic Rules — Selection pushdown, constant folding, and predicate simplification are universally beneficial transformations.
•Cost-Based Selection — Plans are evaluated using cost models that estimate I/O, CPU, and memory usage based on statistics.
•Statistics Are Critical — Cardinality estimation depends on accurate, up-to-date statistics. Stale statistics produce bad plans.
•Exponential Search Space — Join order enumeration explodes combinatorially; dynamic programming and heuristics make it tractable.
•Physical Operators — Logical operations map to specific algorithms (hash join, merge join, index scan) based on data characteristics.
•Plan Caching — Reusing plans avoids optimization overhead but risks using suboptimal plans for varied parameter values.

What's Next:

Page Complete

4 / 5