Sql Execution Flow - Learning Module

Loading content...

0/241

Query Optimization

The Intelligence Behind SQL Performance

The SQL query optimizer is often considered the most sophisticated component of a database management system—and for good reason. When you write a SQL query, you're specifying what data you want, not how to retrieve it. The optimizer's job is to figure out the how, and to do it better than most humans could manually.

Consider a query joining five tables with multiple filter conditions. The number of possible execution strategies is astronomical—different join orders, different join algorithms, different access paths for each table. A naive approach might examine every row in every table; the optimal approach might use indexes to examine only a handful of rows. The difference could be seconds versus hours.

Query optimization is the process of transforming a parsed query into an efficient execution plan by:

Exploring equivalent ways to express the query (transformation-based optimization)
Estimating the cost of different execution strategies (cost-based optimization)
Selecting the strategy with the lowest estimated cost

This page explores the optimizer's architecture, the strategies it employs, and how understanding optimization helps you write queries that perform brilliantly at scale.

What You Will Learn

By the end of this page, you'll understand how cost-based optimizers work, what statistics they rely on, how they evaluate join orders and access paths, and what transformations they apply to improve query performance. You'll gain insight into why seemingly similar queries can have vastly different performance characteristics.

Optimizer Architecture Overview

Modern query optimizers are complex pieces of engineering that have evolved over decades of database research. Most production systems use a cost-based optimizer (CBO) that estimates the computational cost of different execution strategies and selects the cheapest one.

The optimization process takes the validated parse tree from the parsing phase and produces an execution plan—a detailed specification of exactly how the query should be executed.

Key Optimizer Components:

Converting Mermaid diagram...

Optimizer Components Explained

•Query Rewriter: Applies logical transformations that produce equivalent but potentially more efficient queries. Examples: view expansion, subquery unnesting, predicate pushdown
•Plan Enumerator: Generates candidate execution plans by considering different access paths, join orders, and join algorithms. The search space can be exponentially large
•Cost Estimator: Calculates the estimated cost of each candidate plan using statistics about data distribution, available indexes, and hardware characteristics
•Plan Selector: Compares estimated costs and selects the plan with the lowest cost for execution

Two Optimization Paradigms:

1. Rule-Based Optimization (RBO): Uses a fixed set of heuristic rules ranked by priority. First applicable rule wins. Simple but inflexible—doesn't adapt to data characteristics.

2. Cost-Based Optimization (CBO): Estimates actual execution costs using statistics. More accurate but computationally intensive. All modern databases primarily use CBO.

Most systems combine both: rules handle obvious transformations (predicate pushdown always helps), while CBO handles decisions that depend on data (which join order is best).

Optimization Is Not Perfect

The optimizer makes decisions based on estimates, which can be wrong. Stale statistics, skewed data distributions, and correlated columns can all mislead the optimizer. Understanding how the optimizer works helps you recognize when it's making suboptimal choices and how to guide it toward better plans.

Logical Transformations

Before exploring physical execution strategies, the optimizer applies logical transformations that produce semantically equivalent queries but may have better optimization potential. These transformations are based on algebraic equivalences in relational algebra.

Common Logical Transformations
Transformation	Description	Example
Predicate Pushdown	Move filter conditions closer to base tables	`JOIN ... WHERE a.x=1` → Apply `x=1` before join
Projection Pushdown	Eliminate unneeded columns early	Remove columns not in SELECT or JOIN conditions
Subquery Unnesting	Convert correlated subquery to join	`WHERE id IN (subquery)` → Semi-join
Join Reordering	Rearrange join order for efficiency	Join smallest results first
Constant Folding	Evaluate constant expressions	`WHERE date > '2024-01-01' + 30` → `2024-01-31`
Predicate Simplification	Remove redundant predicates	`WHERE x>5 AND x>3` → `WHERE x>5`
View Merging	Merge view definition into outer query	Allow optimization across view boundary
Common Subexpression Elimination	Compute repeated expressions once	Factor out repeated calculations

Predicate Pushdown in Detail:

Predicate pushdown is one of the most impactful optimizations. By filtering rows early, we reduce the data flowing through expensive operations like joins.

predicate_pushdown.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Original query
SELECT o.order_id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA' AND o.total > 1000;
 
-- Without pushdown (inefficient):
-- 1. Join ALL orders with ALL customers
-- 2. Then filter for country='USA' and total>1000
-- Result: Massive intermediate join result
 
-- With pushdown (optimized):
-- 1. Filter customers where country='USA'
-- 2. Filter orders where total>1000
-- 3. Join the filtered results
-- Result: Much smaller intermediate data
 
-- Internally transformed to:
SELECT o.order_id, c.name, o.total
FROM (SELECT * FROM orders WHERE total > 1000) o
JOIN (SELECT * FROM customers WHERE country = 'USA') c 
  ON o.customer_id = c.id;

Subquery Unnesting:

Correlated subqueries can be extremely expensive because they conceptually execute once per row of the outer query. Unnesting transforms them into joins, which can use efficient join algorithms.

subquery_unnesting.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Correlated subquery (potentially slow)
SELECT *
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary) 
    FROM employees e2 
    WHERE e2.department_id = e.department_id
);
-- Conceptually: For each employee, compute dept average
-- O(n²) behavior if executed naively
 
-- After unnesting (optimized)
SELECT e.*
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) as avg_sal
    FROM employees
    GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_sal;
-- Compute averages once, then join
-- O(n) behavior

When Transformations Apply

Not all transformations always apply. Outer joins restrict predicate pushdown (can't push to the preserved side). EXISTS subqueries unnest differently than IN subqueries. The optimizer has complex rules about when each transformation is valid and beneficial.

Access Path Selection

For each table in the query, the optimizer must choose an access path—the method used to retrieve rows from that table. The choice dramatically affects performance.

Available Access Paths:

Common Access Paths
Access Path	Description	Best When	Cost Profile
Full Table Scan	Read every row in the table sequentially	Selecting large fraction of rows; no useful index	O(n) pages read, sequential I/O
Index Scan	Navigate B-tree index to find matching rows	Selective predicates on indexed columns	O(log n + k) where k = matching rows
Index-Only Scan	Retrieve all needed columns from index	All columns in query are in the index	Very fast; avoids table access
Bitmap Index Scan	Combine multiple indexes via bitmaps	Multiple predicates, each partially selective	Efficient for complex WHERE clauses
Index Skip Scan	Skip non-matching index prefixes	Using non-leading index columns	Alternative to full scan when no direct index

The Index Selection Decision:

Choosing between a full table scan and an index scan depends on selectivity—what fraction of rows the query needs.

Converting Mermaid diagram...

Why Full Scans Sometimes Win:

It's counterintuitive, but full table scans can outperform index scans when:

Selectivity is low: If 50% of orders are 'pending', the index doesn't help much—we'll visit most of the table anyway, but with random I/O instead of sequential.
Table is small: For tiny tables that fit in a few pages, scanning is faster than navigating B-tree overhead.
Index is unclustered: If matching rows are scattered across the table, each index lookup causes a random disk seek. Sequential scanning can be 100x faster per page.
Covering queries: A full scan of a narrow table might beat index lookups into a wide table if we only need a few columns.

access_path_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Index scan: highly selective predicate
SELECT * FROM customers WHERE email = 'specific@email.com';
-- Uses index on email: 1 row out of millions
 
-- Full table scan: low selectivity
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Most orders are in 2024; scan is faster
-- Also: function on column prevents index use!
 
-- Index-only scan: all columns in index
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id;
-- If index on (customer_id) covers the query
 
-- Bitmap index scan: multiple conditions
SELECT * FROM products 
WHERE category = 'Electronics' AND price < 100 AND in_stock = true;
-- Combine bitmaps from category, price, in_stock indexes

Functions Prevent Index Use

Applying functions to indexed columns typically prevents index use. WHERE UPPER(name) = 'JOHN' can't use an index on name. Solutions: use expression/functional indexes, or rewrite predicates (WHERE name = 'John' OR name = 'john' ...).

Join Optimization

Joins are often the most expensive operations in SQL queries, and optimizing them is a major focus of the query optimizer. Two key decisions must be made:

Join Order: In what sequence should tables be joined?
Join Algorithm: What method should be used for each join?

Join Ordering:

For n tables, there are n! possible join orders—5 tables have 120 orderings; 10 tables have 3.6 million. Finding the optimal order is NP-hard, so optimizers use heuristics and dynamic programming to explore promising orders without exhaustive search.

Join Algorithms:

The optimizer chooses from several join algorithms, each suited to different scenarios:

Join Algorithm Comparison
Algorithm	How It Works	Best When	Complexity
Nested Loop Join	For each row in outer table, scan inner table	Small outer table, indexed inner table	O(n × m), but can use index
Index Nested Loop	Nested loop with index lookup on inner table	Selective outer, good index on inner	O(n × log m) with index
Hash Join	Build hash table on smaller table, probe with larger	Large tables, no useful indexes, equi-join	O(n + m) with enough memory
Sort-Merge Join	Sort both tables, merge sorted streams	Both tables already sorted, or need sorted output	O(n log n + m log m)
Bitmap Join	Use bitmap for matching keys	Star schema, dimension-fact joins	Efficient for multi-way joins

Converting Mermaid diagram...

Choosing the Right Algorithm:

The optimizer considers:

Table sizes and selectivity: Hash joins excel for large, unsorted tables
Available indexes: Index nested loop is fast when good indexes exist
Memory constraints: Hash joins need memory for the hash table; sorts need memory for sorting
Join predicate type: Hash joins only work for equi-joins (=); nested loops handle any predicate
Result ordering requirements: Sort-merge produces sorted output automatically

join_scenarios.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Scenario 1: Index nested loop optimal
-- Small outer (filtered customers), indexed inner (orders)
SELECT c.name, o.order_date, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.status = 'VIP';
-- Few VIP customers; use index on orders.customer_id
 
-- Scenario 2: Hash join optimal
-- Large tables, no selective filters, no covering index
SELECT *
FROM large_table_a a
JOIN large_table_b b ON a.key = b.key;
-- Build hash on smaller table, probe with larger
 
-- Scenario 3: Sort-merge join optimal
-- Both tables already sorted by join key
-- Or query needs ORDER BY on join key anyway
SELECT *
FROM sorted_events_a a
JOIN sorted_events_b b ON a.event_time = b.event_time
ORDER BY a.event_time;
-- Merge sorted streams; output is already sorted

Join Order Matters—A Lot

Joining orders (1M rows) with customers (100K rows) then with countries (200 rows) is very different from joining countries with customers then with orders. The optimizer tries to minimize intermediate result sizes by joining the most selective tables first.

Cost Estimation

At the heart of cost-based optimization is the cost model—a way to estimate how expensive each execution plan will be. The optimizer doesn't run each plan and measure; it estimates costs mathematically based on statistics.

What Does 'Cost' Mean?

Cost is typically measured in abstract units that correlate with execution time. Different databases define cost differently, but common components include:

Cost Model Components
Cost Factor	What It Measures	Typical Weight
Sequential I/O	Pages read sequentially from disk	1.0 (baseline)
Random I/O	Random disk seeks	4.0 (4x sequential)
CPU Processing	Tuple processing, function evaluation	0.01 per tuple
Network I/O	Data transfer for distributed queries	Varies by system
Memory Usage	Working memory for sorts, hashes	Implicit in algorithm choice

Cardinality Estimation:

The most critical input to cost estimation is cardinality—how many rows will each operation produce? Accurate cardinality estimates are essential because:

Join costs depend on sizes of both inputs
Access path benefits depend on selectivity
Memory allocation depends on expected data volume
Wrong estimates cascade through the plan, compounding errors

How Cardinality Is Estimated:

Cardinality Estimation Techniques

•Row Counts: Basic statistics on total rows per table
•Distinct Value Counts (NDV): Number of unique values per column, used for equality selectivity
•Histograms: Distribution of values, handling skewed data where some values are much more common
•Correlation Statistics: Dependencies between columns (some databases)
•Sampling: Dynamically sample data when statistics are unavailable

cardinality_estimation.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Example: Estimating cardinality for a filter
-- Table: orders (1,000,000 rows)
-- Column: status has 5 distinct values (assuming uniform distribution)
 
SELECT * FROM orders WHERE status = 'pending';
-- Estimated cardinality: 1,000,000 / 5 = 200,000 rows
-- (assuming uniform distribution)
 
-- But what if 80% of orders are 'completed'?
-- Without histogram: estimate = 200,000 (WRONG)
-- With histogram: estimate accurately reflects actual distribution
 
-- Compound predicates (assumption of independence)
SELECT * FROM orders 
WHERE status = 'pending' AND region = 'North America';
-- If status has 5 values, region has 10 values:
-- Estimated: 1,000,000 × (1/5) × (1/10) = 20,000 rows
-- But if 'pending' orders are mostly in 'North America',
-- actual might be 150,000 rows (correlation not captured)

Cardinality Errors Are the #1 Cause of Bad Plans

When the optimizer underestimates cardinality, it might choose nested loops over hash join—disastrous for large tables. When it overestimates, it might scan when an index would help. Keep statistics updated with regular ANALYZE/UPDATE STATISTICS commands.

Statistics Maintenance:

Statistics must be kept current as data changes. Stale statistics lead to bad estimates and bad plans.

Database	Command	Notes
PostgreSQL	`ANALYZE table_name;`	Can be automatic (autovacuum)
MySQL	`ANALYZE TABLE table_name;`	For InnoDB, updates index statistics
Oracle	`DBMS_STATS.GATHER_TABLE_STATS()`	Rich options for sampling, histograms
SQL Server	`UPDATE STATISTICS table_name;`	Auto-update on significant changes

Optimizer Search Strategies

Given the enormous search space of possible plans, optimizers use sophisticated algorithms to find good plans without exhaustive search.

Dynamic Programming (System R Approach):

The classic approach, pioneered by IBM's System R, uses bottom-up dynamic programming:

Find the best single-table access path for each table
Find the best 2-table joins using optimal 1-table results
Find the best 3-table joins using optimal 2-table results
Continue until all tables are joined

This guarantees finding the optimal bushy tree plan but is exponential in the number of tables.

Converting Mermaid diagram...

Other Search Strategies:

Alternative Optimization Approaches

•Left-Deep Trees Only: Restrict search to left-deep join trees (each join has a base table on the right). Reduces search space from 4^n to n! but misses some optimal plans
•Greedy Algorithms: Make locally optimal choices at each step. Fast but can miss global optimum. Used for very large queries
•Genetic Algorithms: Evolve a population of plans through mutation and crossover. Used by PostgreSQL for queries with many tables (GEQO)
•Randomized Search: Start with a random plan and make improving modifications. May find good plans quickly but no optimality guarantee
•Memoization and Pruning: Cache intermediate results; prune plans that can't be better than current best

Why Not Search Everything?

For 15 tables, there are approximately 10^12 possible join orderings. Even at nanoseconds per plan, exhaustive search would take years. Optimizers must balance plan quality against optimization time—spending 5 seconds optimizing a query that runs for 10 minutes is worthwhile; spending 5 seconds on a query that runs for 1 millisecond is not.

Optimization Time Limits:

Most optimizers have configurable time budgets. When the budget is exhausted, they return the best plan found so far. This is why complex queries sometimes produce suboptimal plans—the optimizer ran out of time to explore better alternatives.

Adaptive Query Optimization

Traditional optimization happens entirely before execution—the plan is fixed based on estimates. Adaptive query optimization allows plans to adjust during execution based on actual runtime information.

Why Adaptive Optimization Matters:

Cardinality estimates are often inaccurate
Data distributions change between optimization and execution
Some information (parameter values) isn't known at optimization time
Cached plans may be reused for different data conditions

Adaptive Optimization Techniques
Technique	How It Works	Database Support
Adaptive Joins	Start with nested loop, switch to hash if cardinality exceeds threshold	Oracle 12c+
Adaptive Statistics	Gather missing statistics during execution for future queries	Oracle 12c+
Adaptive Cursor Sharing	Create multiple plan variants for different parameter values	Oracle 11g+
Mid-execution Reoptimization	Pause execution, reoptimize with actual cardinalities	IBM DB2
Progressive Query Processing	Produce approximate results quickly, refine over time	Research systems

Adaptive Joins in Oracle:

Oracle's adaptive join planning works like this:

At optimization time, create an adaptive plan with inflection points
During execution, measure actual row counts at inflection points
If actual count differs significantly from estimate, switch strategies
Store the decision for future executions

Converting Mermaid diagram...

Adaptive Features Must Be Enabled

Many adaptive optimization features are off by default due to their runtime overhead. Check your database documentation for parameters like OPTIMIZER_ADAPTIVE_PLANS (Oracle), adaptive_cursor_sharing (Oracle), or similar settings in your DBMS.

Summary: Query Optimization

Query optimization is the intelligence that transforms your declarative SQL into efficient execution strategies. Let's consolidate the key concepts:

Key Takeaways

•Cost-based optimization is standard: Modern databases estimate costs of different plans and choose the cheapest one
•Logical transformations improve plans: Predicate pushdown, subquery unnesting, and join reordering create more efficient equivalent queries
•Access path selection matters: Choosing between full scans, index scans, and other paths depends on selectivity and data layout
•Join optimization is critical: Join order and join algorithm choices can make orders-of-magnitude differences in performance
•Accurate statistics are essential: Cardinality estimation errors are the primary cause of bad plans—keep statistics updated
•Optimizer search is bounded: Exponential search spaces require heuristics and time limits; complex queries may get suboptimal plans
•Adaptive optimization helps: Runtime adjustments based on actual data can compensate for estimation errors

What's Next:

With optimization complete, the optimizer produces an execution plan—a detailed recipe for how to execute the query. The next page examines execution plans in detail: how to read them, what operators they contain, and how to use them for performance troubleshooting.

Page Complete

You now understand how SQL query optimizers work—from logical transformations through cost estimation to plan selection. This knowledge is essential for writing queries that perform well and diagnosing performance problems when they occur. Next, we'll learn to read and interpret the execution plans the optimizer produces.

Query Optimization

The Intelligence Behind SQL Performance

Query optimization is the process of transforming a parsed query into an efficient execution plan by:

Exploring equivalent ways to express the query (transformation-based optimization)
Estimating the cost of different execution strategies (cost-based optimization)
Selecting the strategy with the lowest estimated cost

This page explores the optimizer's architecture, the strategies it employs, and how understanding optimization helps you write queries that perform brilliantly at scale.

What You Will Learn

Optimizer Architecture Overview

The optimization process takes the validated parse tree from the parsing phase and produces an execution plan—a detailed specification of exactly how the query should be executed.

Key Optimizer Components:

Converting Mermaid diagram...

Optimizer Components Explained

•Query Rewriter: Applies logical transformations that produce equivalent but potentially more efficient queries. Examples: view expansion, subquery unnesting, predicate pushdown
•Plan Enumerator: Generates candidate execution plans by considering different access paths, join orders, and join algorithms. The search space can be exponentially large
•Cost Estimator: Calculates the estimated cost of each candidate plan using statistics about data distribution, available indexes, and hardware characteristics
•Plan Selector: Compares estimated costs and selects the plan with the lowest cost for execution

Two Optimization Paradigms:

1. Rule-Based Optimization (RBO): Uses a fixed set of heuristic rules ranked by priority. First applicable rule wins. Simple but inflexible—doesn't adapt to data characteristics.

2. Cost-Based Optimization (CBO): Estimates actual execution costs using statistics. More accurate but computationally intensive. All modern databases primarily use CBO.

Most systems combine both: rules handle obvious transformations (predicate pushdown always helps), while CBO handles decisions that depend on data (which join order is best).

Optimization Is Not Perfect

Logical Transformations

Common Logical Transformations
Transformation	Description	Example
Predicate Pushdown	Move filter conditions closer to base tables	`JOIN ... WHERE a.x=1` → Apply `x=1` before join
Projection Pushdown	Eliminate unneeded columns early	Remove columns not in SELECT or JOIN conditions
Subquery Unnesting	Convert correlated subquery to join	`WHERE id IN (subquery)` → Semi-join
Join Reordering	Rearrange join order for efficiency	Join smallest results first
Constant Folding	Evaluate constant expressions	`WHERE date > '2024-01-01' + 30` → `2024-01-31`
Predicate Simplification	Remove redundant predicates	`WHERE x>5 AND x>3` → `WHERE x>5`
View Merging	Merge view definition into outer query	Allow optimization across view boundary
Common Subexpression Elimination	Compute repeated expressions once	Factor out repeated calculations

Predicate Pushdown in Detail:

Predicate pushdown is one of the most impactful optimizations. By filtering rows early, we reduce the data flowing through expensive operations like joins.

predicate_pushdown.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Original query
SELECT o.order_id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA' AND o.total > 1000;
 
-- Without pushdown (inefficient):
-- 1. Join ALL orders with ALL customers
-- 2. Then filter for country='USA' and total>1000
-- Result: Massive intermediate join result
 
-- With pushdown (optimized):
-- 1. Filter customers where country='USA'
-- 2. Filter orders where total>1000
-- 3. Join the filtered results
-- Result: Much smaller intermediate data
 
-- Internally transformed to:
SELECT o.order_id, c.name, o.total
FROM (SELECT * FROM orders WHERE total > 1000) o
JOIN (SELECT * FROM customers WHERE country = 'USA') c 
  ON o.customer_id = c.id;

Subquery Unnesting:

Correlated subqueries can be extremely expensive because they conceptually execute once per row of the outer query. Unnesting transforms them into joins, which can use efficient join algorithms.

subquery_unnesting.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Correlated subquery (potentially slow)
SELECT *
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary) 
    FROM employees e2 
    WHERE e2.department_id = e.department_id
);
-- Conceptually: For each employee, compute dept average
-- O(n²) behavior if executed naively
 
-- After unnesting (optimized)
SELECT e.*
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) as avg_sal
    FROM employees
    GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_sal;
-- Compute averages once, then join
-- O(n) behavior

When Transformations Apply

Access Path Selection

For each table in the query, the optimizer must choose an access path—the method used to retrieve rows from that table. The choice dramatically affects performance.

Available Access Paths:

Common Access Paths
Access Path	Description	Best When	Cost Profile
Full Table Scan	Read every row in the table sequentially	Selecting large fraction of rows; no useful index	O(n) pages read, sequential I/O
Index Scan	Navigate B-tree index to find matching rows	Selective predicates on indexed columns	O(log n + k) where k = matching rows
Index-Only Scan	Retrieve all needed columns from index	All columns in query are in the index	Very fast; avoids table access
Bitmap Index Scan	Combine multiple indexes via bitmaps	Multiple predicates, each partially selective	Efficient for complex WHERE clauses
Index Skip Scan	Skip non-matching index prefixes	Using non-leading index columns	Alternative to full scan when no direct index

The Index Selection Decision:

Choosing between a full table scan and an index scan depends on selectivity—what fraction of rows the query needs.

Converting Mermaid diagram...

Why Full Scans Sometimes Win:

It's counterintuitive, but full table scans can outperform index scans when:

Selectivity is low: If 50% of orders are 'pending', the index doesn't help much—we'll visit most of the table anyway, but with random I/O instead of sequential.
Table is small: For tiny tables that fit in a few pages, scanning is faster than navigating B-tree overhead.
Index is unclustered: If matching rows are scattered across the table, each index lookup causes a random disk seek. Sequential scanning can be 100x faster per page.
Covering queries: A full scan of a narrow table might beat index lookups into a wide table if we only need a few columns.

access_path_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Index scan: highly selective predicate
SELECT * FROM customers WHERE email = 'specific@email.com';
-- Uses index on email: 1 row out of millions
 
-- Full table scan: low selectivity
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Most orders are in 2024; scan is faster
-- Also: function on column prevents index use!
 
-- Index-only scan: all columns in index
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id;
-- If index on (customer_id) covers the query
 
-- Bitmap index scan: multiple conditions
SELECT * FROM products 
WHERE category = 'Electronics' AND price < 100 AND in_stock = true;
-- Combine bitmaps from category, price, in_stock indexes

Functions Prevent Index Use

Join Optimization

Joins are often the most expensive operations in SQL queries, and optimizing them is a major focus of the query optimizer. Two key decisions must be made:

Join Order: In what sequence should tables be joined?
Join Algorithm: What method should be used for each join?

Join Ordering:

Join Algorithms:

The optimizer chooses from several join algorithms, each suited to different scenarios:

Join Algorithm Comparison
Algorithm	How It Works	Best When	Complexity
Nested Loop Join	For each row in outer table, scan inner table	Small outer table, indexed inner table	O(n × m), but can use index
Index Nested Loop	Nested loop with index lookup on inner table	Selective outer, good index on inner	O(n × log m) with index
Hash Join	Build hash table on smaller table, probe with larger	Large tables, no useful indexes, equi-join	O(n + m) with enough memory
Sort-Merge Join	Sort both tables, merge sorted streams	Both tables already sorted, or need sorted output	O(n log n + m log m)
Bitmap Join	Use bitmap for matching keys	Star schema, dimension-fact joins	Efficient for multi-way joins

Converting Mermaid diagram...

Choosing the Right Algorithm:

The optimizer considers:

Table sizes and selectivity: Hash joins excel for large, unsorted tables
Available indexes: Index nested loop is fast when good indexes exist
Memory constraints: Hash joins need memory for the hash table; sorts need memory for sorting
Join predicate type: Hash joins only work for equi-joins (=); nested loops handle any predicate
Result ordering requirements: Sort-merge produces sorted output automatically

join_scenarios.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Scenario 1: Index nested loop optimal
-- Small outer (filtered customers), indexed inner (orders)
SELECT c.name, o.order_date, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.status = 'VIP';
-- Few VIP customers; use index on orders.customer_id
 
-- Scenario 2: Hash join optimal
-- Large tables, no selective filters, no covering index
SELECT *
FROM large_table_a a
JOIN large_table_b b ON a.key = b.key;
-- Build hash on smaller table, probe with larger
 
-- Scenario 3: Sort-merge join optimal
-- Both tables already sorted by join key
-- Or query needs ORDER BY on join key anyway
SELECT *
FROM sorted_events_a a
JOIN sorted_events_b b ON a.event_time = b.event_time
ORDER BY a.event_time;
-- Merge sorted streams; output is already sorted

Join Order Matters—A Lot

Cost Estimation

What Does 'Cost' Mean?

Cost is typically measured in abstract units that correlate with execution time. Different databases define cost differently, but common components include:

Cost Model Components
Cost Factor	What It Measures	Typical Weight
Sequential I/O	Pages read sequentially from disk	1.0 (baseline)
Random I/O	Random disk seeks	4.0 (4x sequential)
CPU Processing	Tuple processing, function evaluation	0.01 per tuple
Network I/O	Data transfer for distributed queries	Varies by system
Memory Usage	Working memory for sorts, hashes	Implicit in algorithm choice

Cardinality Estimation:

The most critical input to cost estimation is cardinality—how many rows will each operation produce? Accurate cardinality estimates are essential because:

Join costs depend on sizes of both inputs
Access path benefits depend on selectivity
Memory allocation depends on expected data volume
Wrong estimates cascade through the plan, compounding errors

How Cardinality Is Estimated:

Cardinality Estimation Techniques

•Row Counts: Basic statistics on total rows per table
•Distinct Value Counts (NDV): Number of unique values per column, used for equality selectivity
•Histograms: Distribution of values, handling skewed data where some values are much more common
•Correlation Statistics: Dependencies between columns (some databases)
•Sampling: Dynamically sample data when statistics are unavailable

cardinality_estimation.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Example: Estimating cardinality for a filter
-- Table: orders (1,000,000 rows)
-- Column: status has 5 distinct values (assuming uniform distribution)
 
SELECT * FROM orders WHERE status = 'pending';
-- Estimated cardinality: 1,000,000 / 5 = 200,000 rows
-- (assuming uniform distribution)
 
-- But what if 80% of orders are 'completed'?
-- Without histogram: estimate = 200,000 (WRONG)
-- With histogram: estimate accurately reflects actual distribution
 
-- Compound predicates (assumption of independence)
SELECT * FROM orders 
WHERE status = 'pending' AND region = 'North America';
-- If status has 5 values, region has 10 values:
-- Estimated: 1,000,000 × (1/5) × (1/10) = 20,000 rows
-- But if 'pending' orders are mostly in 'North America',
-- actual might be 150,000 rows (correlation not captured)

Cardinality Errors Are the #1 Cause of Bad Plans

Statistics Maintenance:

Statistics must be kept current as data changes. Stale statistics lead to bad estimates and bad plans.

Database	Command	Notes
PostgreSQL	`ANALYZE table_name;`	Can be automatic (autovacuum)
MySQL	`ANALYZE TABLE table_name;`	For InnoDB, updates index statistics
Oracle	`DBMS_STATS.GATHER_TABLE_STATS()`	Rich options for sampling, histograms
SQL Server	`UPDATE STATISTICS table_name;`	Auto-update on significant changes

Optimizer Search Strategies

Given the enormous search space of possible plans, optimizers use sophisticated algorithms to find good plans without exhaustive search.

Dynamic Programming (System R Approach):

The classic approach, pioneered by IBM's System R, uses bottom-up dynamic programming:

Find the best single-table access path for each table
Find the best 2-table joins using optimal 1-table results
Find the best 3-table joins using optimal 2-table results
Continue until all tables are joined

This guarantees finding the optimal bushy tree plan but is exponential in the number of tables.

Converting Mermaid diagram...

Other Search Strategies:

Alternative Optimization Approaches

•Left-Deep Trees Only: Restrict search to left-deep join trees (each join has a base table on the right). Reduces search space from 4^n to n! but misses some optimal plans
•Greedy Algorithms: Make locally optimal choices at each step. Fast but can miss global optimum. Used for very large queries
•Genetic Algorithms: Evolve a population of plans through mutation and crossover. Used by PostgreSQL for queries with many tables (GEQO)
•Randomized Search: Start with a random plan and make improving modifications. May find good plans quickly but no optimality guarantee
•Memoization and Pruning: Cache intermediate results; prune plans that can't be better than current best

Why Not Search Everything?

Optimization Time Limits:

Adaptive Query Optimization

Why Adaptive Optimization Matters:

Cardinality estimates are often inaccurate
Data distributions change between optimization and execution
Some information (parameter values) isn't known at optimization time
Cached plans may be reused for different data conditions

Adaptive Optimization Techniques
Technique	How It Works	Database Support
Adaptive Joins	Start with nested loop, switch to hash if cardinality exceeds threshold	Oracle 12c+
Adaptive Statistics	Gather missing statistics during execution for future queries	Oracle 12c+
Adaptive Cursor Sharing	Create multiple plan variants for different parameter values	Oracle 11g+
Mid-execution Reoptimization	Pause execution, reoptimize with actual cardinalities	IBM DB2
Progressive Query Processing	Produce approximate results quickly, refine over time	Research systems

Adaptive Joins in Oracle:

Oracle's adaptive join planning works like this:

At optimization time, create an adaptive plan with inflection points
During execution, measure actual row counts at inflection points
If actual count differs significantly from estimate, switch strategies
Store the decision for future executions

Converting Mermaid diagram...

Adaptive Features Must Be Enabled

Summary: Query Optimization

Query optimization is the intelligence that transforms your declarative SQL into efficient execution strategies. Let's consolidate the key concepts:

Key Takeaways

•Cost-based optimization is standard: Modern databases estimate costs of different plans and choose the cheapest one
•Logical transformations improve plans: Predicate pushdown, subquery unnesting, and join reordering create more efficient equivalent queries
•Access path selection matters: Choosing between full scans, index scans, and other paths depends on selectivity and data layout
•Join optimization is critical: Join order and join algorithm choices can make orders-of-magnitude differences in performance
•Accurate statistics are essential: Cardinality estimation errors are the primary cause of bad plans—keep statistics updated
•Optimizer search is bounded: Exponential search spaces require heuristics and time limits; complex queries may get suboptimal plans
•Adaptive optimization helps: Runtime adjustments based on actual data can compensate for estimation errors

What's Next:

Page Complete