Database Management SystemsHeuristic Optimization

Heuristic Optimization

LevelIntermediate

Duration90 mins

TopicHeuristic Optimization

2 / 5

Common Heuristics

The Optimizer's Playbook

Every query optimizer maintains a playbook—a collection of battle-tested strategies accumulated over decades of database system development. These strategies, known as heuristics, are transformations that improve query performance in the vast majority of cases without requiring detailed knowledge about data distributions.

Unlike cost-based decisions that weigh alternatives using statistics, heuristics encode universal wisdom: reduce data volume early, minimize expensive operations, and structure computations to exploit available optimizations. These principles transcend any particular database system or workload.

What You Will Learn

By the end of this page, you will understand the major categories of heuristic optimizations, the specific transformations within each category, why these heuristics work in practice, and how to recognize when they can dramatically improve query performance. You'll develop intuition for the core principles that guide query optimization.

The Heuristic Hierarchy

Not all heuristics are equal in their impact or applicability. They form a hierarchy based on how universally beneficial and how significant their performance improvements typically are.

The Optimization Impact Pyramid

Heuristics can be organized into tiers based on their typical impact:

Converting Mermaid diagram...

The Core Meta-Heuristics

Underlying all specific transformation rules are a few meta-heuristics—abstract principles that guide optimizer design:

Fundamental Meta-Heuristics in Query Optimization
Meta-Heuristic	Principle	Manifestations
Reduce Early	Minimize data volume as early as possible in the plan	Selection pushdown, projection pushdown, early aggregation
Avoid Redundancy	Never compute the same result twice	Common subexpression elimination, view caching, memoization
Prefer Pipeline	Keep data flowing rather than materializing intermediate results	Avoid sorts when possible, prefer streaming joins
Exploit Order	Use existing sort order instead of creating new ones	Index scan exploitation, merge join when data is pre-sorted
Minimize I/O	Disk access dominates cost; design plans to minimize it	Index-only scans, covering indexes, sequential access patterns

Thinking in Meta-Heuristics

When analyzing a query plan or designing optimizations, always ask: 'Am I reducing early? Avoiding redundancy? Exploiting available order?' These questions reveal optimization opportunities that specific rules might miss.

Data Volume Reduction Heuristics

The most impactful heuristics focus on reducing the amount of data that flows through the query plan. Less data means less memory pressure, fewer disk I/Os, and faster execution at every subsequent stage.

Selection Pushdown (Filter Early)

The principle is simple: if you're going to filter out rows, do it as early as possible. Every row filtered early is a row that doesn't need to be joined, grouped, or sorted.

before_selection_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Unoptimized: Filter after join
SELECT o.order_id, c.name
FROM orders o
JOIN customers c 
  ON o.customer_id = c.id
WHERE c.country = 'USA'
  AND o.order_date > '2024-01-01';
 
-- Conceptual plan:
-- 1. Scan all orders (10M rows)
-- 2. Scan all customers (1M rows)
-- 3. Join (produces 10M matches)
-- 4. Filter (reduces to 100K rows)
 
-- Joined 10M rows just to keep 100K!

after_selection_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Optimized: Filter before join
SELECT o.order_id, c.name
FROM orders o
JOIN customers c 
  ON o.customer_id = c.id
WHERE c.country = 'USA'      -- Filter c first
  AND o.order_date > '2024-01-01';
 
-- Conceptual plan (optimized):
-- 1. Scan orders, filter date (500K rows)
-- 2. Scan customers, filter country (50K rows)
-- 3. Join (produces 100K matches)
-- 4. No additional filtering needed
 
-- Joined 10-20x fewer rows!

Projection Pushdown (Column Pruning)

Just as selection reduces rows, projection reduces columns. Pushing projection early eliminates unused columns before they consume memory or bandwidth.

Consider a table with 50 columns, where the query only needs 3. Without projection pushdown, every operation handles all 50 columns. With pushdown, operations handle only the 3 required columns—potentially 16× less memory per row.

projection_pushdown_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Query requesting only specific columns
SELECT c.name, SUM(o.total_amount)
FROM orders o                        -- orders has 30 columns
JOIN customers c ON o.customer_id = c.id  -- customers has 25 columns
GROUP BY c.name;
 
-- Without projection pushdown:
-- Join processes all 55 columns (30 + 25)
-- Memory per row: potentially 2KB+
-- 1M joined rows = 2GB memory pressure
 
-- With projection pushdown:
-- Orders: only fetch customer_id, total_amount (2 columns)
-- Customers: only fetch id, name (2 columns)
-- Memory per row: ~100 bytes
-- 1M joined rows = 100MB memory pressure
-- 20× reduction in memory usage!

Predicate Derivation

Smart optimizers don't just push down explicit predicates—they derive new predicates from join conditions and transitivity:

Transitivity Derivation: If A.x = B.x and B.x = 10, then A.x = 10. This derived predicate can be pushed to table A.

Range Propagation: If A.date >= B.date and A.date <= '2024-06-30', then B.date <= '2024-06-30'.

predicate_derivation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Original query with join-based filter
SELECT *
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE oi.product_id = 42;
 
-- Derived predicate: Since o.id = oi.order_id,
-- and we're only keeping oi rows where product_id = 42,
-- we can potentially push predicates to orders if we know
-- which orders contain product 42.
 
-- More powerful example:
SELECT *
FROM regions r
JOIN countries c ON r.region_id = c.region_id
JOIN states s ON c.country_id = s.country_id
WHERE r.name = 'North America';
 
-- Optimizer derives:
-- c.region_id must match regions where name='North America'
-- This derived constraint can filter countries early
-- before joining with states

Predicate Pushdown Blockers

Not all predicates can be pushed down. Outer joins block pushdown on the nullable side. Aggregations block pushdown of predicates that reference aggregate results. Volatile functions (like RANDOM()) cannot be pushed because they must be evaluated exactly once at their original position.

Join Ordering Heuristics

When queries involve multiple joins, the order in which tables are joined can affect performance by orders of magnitude. While cost-based optimizers use statistics to determine optimal join order, heuristic approaches use simpler rules that work well in practice.

The Join Order Impact

To understand why join order matters so much, consider a query joining four tables: A (1,000 rows), B (10,000 rows), C (100 rows), and D (500 rows).

Join Order	Intermediate Sizes (estimated)	Total Work
((A ⋈ B) ⋈ C) ⋈ D	10K → 1K → 500	~11,500
((A ⋈ C) ⋈ B) ⋈ D	100 → 1K → 500	~1,600
((C ⋈ D) ⋈ A) ⋈ B	50 → 50 → 500	~600

The difference is 20× without any cost model—you just need to know table sizes!

Heuristic: Start with Smallest Tables

The most fundamental join ordering heuristic is to start with the smallest tables (after applying filters). This minimizes the size of intermediate results.

smallest_first_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Query joining multiple tables
SELECT *
FROM products p           -- 100K rows
JOIN categories c         -- 50 rows (small!)
  ON p.category_id = c.id
JOIN suppliers s          -- 1K rows
  ON p.supplier_id = s.id
JOIN inventory i          -- 500K rows
  ON p.id = i.product_id;
 
-- Heuristic join order:
-- 1. Start with categories (50 rows) - smallest
-- 2. Join products (filtered by category: ~2K rows)
-- 3. Join suppliers (further filtered: ~2K rows)  
-- 4. Join inventory (final: ~10K rows)
 
-- Alternative (poor) order:
-- 1. Start with inventory (500K rows)
-- 2. Join products (still 500K rows!)
-- 3. Join suppliers (still ~500K!)
-- 4. Join categories (finally filters down)
 
-- First approach: ~15K intermediate rows total
-- Second approach: ~1.5M intermediate rows total
-- 100× difference!

Heuristic: Prefer Selective Joins First

Beyond table size, join selectivity matters. A join that matches few rows per left tuple reduces intermediate size more than a join that matches many rows.

Join Selectivity Indicators

•Primary key joins — Always many-to-one or one-to-one, never expand rows. Prefer these early.
•Unique constraint joins — Similar to PK joins, guaranteed not to expand significantly.
•Foreign key joins — Typically many-to-one from child to parent. Join child → parent, not parent → child.
•Many-to-many joins — Can cause row explosion. Delay these until other filters have reduced data.
•Self-joins — Often many-to-many and expensive. Push filters aggressively before self-joining.

Heuristic: Maintain Join Graph Connectivity

Another key heuristic: don't create Cartesian products. A Cartesian product between tables A (1K rows) and B (1K rows) produces 1M row pairs—almost always disastrous.

When ordering joins, always ensure each new table is connected to a table already in the partial result via a join condition. This prevents accidental Cartesian products.

avoiding_cartesian_products.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Query with chain join pattern: A -> B -> C
SELECT *
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN order_items oi ON o.id = oi.order_id;
 
-- Valid join orders (maintain connectivity):
-- c → o → oi  (customers first, then orders, then items)
-- oi → o → c  (items first, then orders, then customers)
 
-- INVALID order (causes Cartesian product):
-- c → oi (no direct join condition!) → o
-- This would create c × oi Cartesian product
-- before filtering by the join to orders
 
-- For 1000 customers, 10K orders, 100K items:
-- Valid: 1000 → 10K → 100K (progressive)
-- Invalid: 1000 × 100K = 100M rows (!!)

Join Graph Analysis

Modern optimizers build a 'join graph' where tables are nodes and join predicates are edges. Valid join orders correspond to connected subgraphs. Some advanced techniques (like Cartesian product injection) deliberately add cross-joins when other heuristics suggest it might help—but this requires cost-based evaluation.

Subquery Transformation Heuristics

Subqueries, while powerful for expressing complex conditions, can create significant optimization barriers. Correlated subqueries are particularly problematic—they execute the subquery for each row of the outer query, often causing O(n²) behavior.

Heuristic optimization includes a rich set of subquery transformations:

Subquery Flattening (Unnesting)

The most important subquery transformation converts subqueries into joins, allowing the optimizer to reason about the entire query holistically.

correlated_subquery_before.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Correlated subquery (executes per row)
SELECT c.name, c.email
FROM customers c
WHERE EXISTS (
    SELECT 1 
    FROM orders o
    WHERE o.customer_id = c.id
    AND o.total > 1000
);
 
-- Execution model:
-- For EACH customer (1M iterations):
--   Run subquery on orders
--   Check if any order > 1000
-- 
-- Total: ~1M subquery executions!

correlated_subquery_after.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Flattened to semi-join
SELECT DISTINCT c.name, c.email
FROM customers c
SEMI JOIN orders o
  ON o.customer_id = c.id
  AND o.total > 1000;
 
-- Or using explicit JOIN + DISTINCT:
SELECT DISTINCT c.name, c.email
FROM customers c
JOIN orders o
  ON o.customer_id = c.id
WHERE o.total > 1000;
 
-- Execution: Single join operation
-- Uses hash join or index join
-- Much faster for large tables!

IN Subquery Conversion

IN subqueries are converted to semi-joins, while NOT IN subqueries become anti-joins. These transformations are particularly impactful:

in_subquery_conversion.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- IN subquery (common pattern)
SELECT product_name
FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE department = 'Electronics'
);
 
-- Converted to semi-join:
SELECT p.product_name
FROM products p
SEMI JOIN categories c
  ON p.category_id = c.id
WHERE c.department = 'Electronics';
 
-- Optimizer can now:
-- 1. Push 'department = Electronics' to categories scan
-- 2. Build hash table on filtered categories
-- 3. Probe with products
-- 4. Uses hash semi-join: O(m + n) instead of O(m × n)
 
-- NOT IN requires anti-join:
SELECT product_name FROM products
WHERE category_id NOT IN (
    SELECT id FROM categories WHERE discontinued = true
);
 
-- Becomes:
SELECT p.product_name
FROM products p
ANTI JOIN categories c
  ON p.category_id = c.id AND c.discontinued = true;

Scalar Subquery Caching

Scalar subqueries (returning a single value) in the SELECT clause can often be cached or converted to outer joins:

scalar_subquery_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Scalar subquery in SELECT (potentially expensive)
SELECT 
    o.order_id,
    o.total,
    (SELECT name FROM customers WHERE id = o.customer_id) as customer_name
FROM orders o;
 
-- Converted to LEFT JOIN:
SELECT 
    o.order_id,
    o.total,
    c.name as customer_name
FROM orders o
LEFT JOIN customers c ON c.id = o.customer_id;
 
-- Benefits:
-- 1. Single join operation vs. per-row subquery
-- 2. Can use hash or index join
-- 3. Optimizer can reorder with other joins
-- 4. Predicates on c.name can be pushed down

Subquery Flattening Limitations

Not all subqueries can be safely flattened. Subqueries with LIMIT, DISTINCT, GROUP BY, or aggregate functions may change semantics if flattened naively. Additionally, NOT IN with nullable columns has tricky NULL semantics that complicate anti-join conversion. Modern optimizers include sophisticated analysis to detect these cases.

Expression Optimization Heuristics

Beyond restructuring query plans, optimizers apply numerous expression-level transformations. These micro-optimizations accumulate to produce significant performance improvements.

Predicate Simplification

Boolean expressions are simplified using standard logical identities:

Boolean Simplification Rules
Pattern	Simplifies To	Rule Name
`A AND TRUE`	`A`	Identity
`A OR FALSE`	`A`	Identity
`A AND FALSE`	`FALSE`	Annihilation
`A OR TRUE`	`TRUE`	Annihilation
`A AND A`	`A`	Idempotence
`A OR A`	`A`	Idempotence
`A AND NOT A`	`FALSE`	Contradiction
`A OR NOT A`	`TRUE`	Tautology
`NOT (NOT A)`	`A`	Double Negation
`A AND (A OR B)`	`A`	Absorption
`NOT (A AND B)`	`NOT A OR NOT B`	De Morgan's

Comparison Simplification

Comparison predicates have their own optimization rules:

comparison_simplification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Range overlap detection
-- Before: Two overlapping range conditions
WHERE price >= 50 AND price >= 30 AND price <= 100 AND price < 90
 
-- After: Simplified to tightest bounds
WHERE price >= 50 AND price < 90
 
-- Impossible range detection
-- Before: Contradictory conditions
WHERE status = 'active' AND status = 'inactive'
-- After: Replaced with FALSE (query returns empty)
 
-- IN list deduplication
-- Before: Duplicates in IN list
WHERE category IN ('A', 'B', 'A', 'C', 'B')
-- After: Unique values
WHERE category IN ('A', 'B', 'C')
 
-- Single-value IN conversion
-- Before: IN with single element
WHERE id IN (42)
-- After: Converted to equality (enables index seek)
WHERE id = 42
 
-- BETWEEN normalization
-- Before: Using BETWEEN
WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
-- After (internally): Converted to range
WHERE date >= '2024-01-01' AND date <= '2024-12-31'

Arithmetic Expression Optimization

Arithmetic expressions undergo several transformations:

Arithmetic Optimizations

•Constant Folding — price * 1.1 * 2 → price * 2.2 (compute at optimization time)
•Identity Elimination — amount + 0 → amount, quantity * 1 → quantity
•Zero Annihilation — anything * 0 → 0 (when anything is NOT NULL)
•Strength Reduction — price * 2 → price + price (addition is faster than multiplication on some systems)
•Common Subexpression — (a+b)*c + (a+b)*d → let temp = a+b in temp*c + temp*d

Predicate Extraction for Indexing

One of the most important expression transformations rewrites predicates to enable index usage:

predicate_extraction_for_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Problem: Function on column prevents index use
-- If we have an index on order_date:
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Cannot use index! Must scan all rows and evaluate YEAR()
 
-- Solution: Rewrite as range predicate
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Now uses index range scan!
 
-- Similarly for string operations:
-- Cannot use index:
SELECT * FROM products WHERE UPPER(name) = 'WIDGET';
 
-- Can use index (if case-insensitive collation):
SELECT * FROM products WHERE name = 'widget';
 
-- Or create functional index:
CREATE INDEX idx_upper_name ON products(UPPER(name));
-- Now original query can use index
 
-- Numeric transformations:
-- Cannot use index:
SELECT * FROM accounts WHERE balance + 100 > 500;
 
-- Rewrite to isolate column:
SELECT * FROM accounts WHERE balance > 400;
-- Now uses index on balance!

SARGability

Predicates that can use indexes are called 'SARGable' (Search ARGument able). A key optimization goal is transforming non-SARGable predicates into SARGable forms. As a developer, writing SARGable predicates directly saves the optimizer work and ensures index usage.

Aggregate Optimization Heuristics

Aggregate operations (GROUP BY, DISTINCT, COUNT, SUM, etc.) are often the most expensive parts of analytical queries. Heuristic optimizations target aggregate efficiency:

Early Aggregation (Push Aggregate Down)

Partial aggregates can sometimes be computed before joins, dramatically reducing join input sizes:

late_aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Before: Aggregate after join
SELECT 
    c.region,
    SUM(o.total)
FROM orders o
JOIN customers c 
  ON o.customer_id = c.id
GROUP BY c.region;
 
-- Execution:
-- 1. Join orders (10M) with customers (1M)
-- 2. Produces 10M joined rows
-- 3. Group 10M rows into 10 regions
 
-- Memory: 10M rows during aggregation

early_aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- After: Partial aggregate before join
SELECT 
    c.region,
    SUM(partial.customer_total)
FROM customers c
JOIN (
    SELECT customer_id, SUM(total) as customer_total
    FROM orders
    GROUP BY customer_id
) partial ON partial.customer_id = c.id
GROUP BY c.region;
 
-- Execution:
-- 1. Aggregate orders by customer (10M → 500K)
-- 2. Join 500K with customers
-- 3. Group 500K rows into 10 regions
 
-- Memory: 20× reduction!

Aggregate Function Transformations

Some aggregate functions can be transformed into more efficient forms:

aggregate_transformations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- COUNT(*) optimization
-- Before: Requires reading all columns
SELECT COUNT(*) FROM large_table;
-- Optimized: Can use smallest index or table metadata
-- Some systems store row counts in metadata
 
-- COUNT(column) vs COUNT(*)
SELECT COUNT(nullable_column) FROM table;
-- Must actually check for NULLs, cannot use metadata
 
-- AVG decomposition for parallel aggregation
-- AVG(x) = SUM(x) / COUNT(x)
-- Can compute SUM and COUNT in parallel, combine at end
 
-- DISTINCT aggregates
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- Optimizer may use:
-- 1. Hash-based distinct + count
-- 2. Index scan to retrieve distinct values
-- 3. Bloom filter for approximate counting
 
-- MIN/MAX with index
SELECT MAX(order_date) FROM orders;
-- If index exists on order_date:
-- Single index lookup for rightmost entry
-- O(log n) instead of O(n)!

Group-By Elimination

Sometimes GROUP BY is unnecessary and can be eliminated:

GROUP BY Elimination Cases

•Grouping on key — GROUP BY primary_key_column produces one group per row anyway, so GROUP BY is eliminated.
•Functional dependency — If X determines Y (X → Y), grouping by (X, Y) is the same as grouping by X alone.
•Single-group aggregation — SELECT SUM(total) FROM orders has an implicit single group, no grouping operation needed.
•Grouping with LIMIT 1 — SELECT x FROM t GROUP BY x LIMIT 1 can be rewritten as SELECT DISTINCT x FROM t LIMIT 1.

Aggregate Pushdown Prerequisites

Aggregate pushdown is only valid when the aggregate function is decomposable (can compute partial results that combine to the final result). SUM, COUNT, MIN, MAX are decomposable. AVG can be decomposed as SUM/COUNT. MEDIAN and percentiles are NOT decomposable and require different optimization strategies.

Summary: Common Heuristics

We've surveyed the major categories of heuristic optimizations that form the backbone of query optimization. These patterns, refined over decades of database development, encode practical wisdom about efficient query execution.

Key Takeaways

•Meta-heuristics guide all optimization — Reduce early, avoid redundancy, prefer pipeline, exploit order, minimize I/O. These principles underlie every specific transformation.
•Data volume reduction is paramount — Selection and projection pushdown are the highest-impact optimizations. Reducing row and column counts early pays dividends at every subsequent stage.
•Join ordering follows practical rules — Start with smallest tables, prefer selective joins, never create Cartesian products. These heuristics yield good plans without exhaustive enumeration.
•Subquery transformation unlocks optimization — Converting correlated subqueries to joins enables the optimizer to reason globally about the query, often improving performance by orders of magnitude.
•Expression optimization enables index usage — Simplifying predicates and rewriting non-SARGable conditions allows efficient index access paths.
•Aggregate optimization reduces memory pressure — Early aggregation, function decomposition, and GROUP BY elimination minimize the data flowing through expensive grouping operations.

What's Next:

We've covered common heuristics broadly. The next pages dive deep into two of the most impactful: Selection Before Join (predicate pushdown) and Projection Early (column elimination). These specific techniques warrant detailed exploration given their critical role in query optimization.

Page Complete

You now have a comprehensive map of heuristic query optimization techniques. These patterns provide the foundation for understanding both rule-based optimization and the logical optimization phase of cost-based optimizers.

2 / 5

Loading learning content...

Database Management SystemsHeuristic Optimization

Heuristic Optimization

LevelIntermediate

Duration90 mins

TopicHeuristic Optimization

2 / 5

Common Heuristics

The Optimizer's Playbook

What You Will Learn

The Heuristic Hierarchy

Not all heuristics are equal in their impact or applicability. They form a hierarchy based on how universally beneficial and how significant their performance improvements typically are.

The Optimization Impact Pyramid

Heuristics can be organized into tiers based on their typical impact:

Converting Mermaid diagram...

The Core Meta-Heuristics

Underlying all specific transformation rules are a few meta-heuristics—abstract principles that guide optimizer design:

Fundamental Meta-Heuristics in Query Optimization
Meta-Heuristic	Principle	Manifestations
Reduce Early	Minimize data volume as early as possible in the plan	Selection pushdown, projection pushdown, early aggregation
Avoid Redundancy	Never compute the same result twice	Common subexpression elimination, view caching, memoization
Prefer Pipeline	Keep data flowing rather than materializing intermediate results	Avoid sorts when possible, prefer streaming joins
Exploit Order	Use existing sort order instead of creating new ones	Index scan exploitation, merge join when data is pre-sorted
Minimize I/O	Disk access dominates cost; design plans to minimize it	Index-only scans, covering indexes, sequential access patterns

Thinking in Meta-Heuristics

Data Volume Reduction Heuristics

Selection Pushdown (Filter Early)

The principle is simple: if you're going to filter out rows, do it as early as possible. Every row filtered early is a row that doesn't need to be joined, grouped, or sorted.

before_selection_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Unoptimized: Filter after join
SELECT o.order_id, c.name
FROM orders o
JOIN customers c 
  ON o.customer_id = c.id
WHERE c.country = 'USA'
  AND o.order_date > '2024-01-01';
 
-- Conceptual plan:
-- 1. Scan all orders (10M rows)
-- 2. Scan all customers (1M rows)
-- 3. Join (produces 10M matches)
-- 4. Filter (reduces to 100K rows)
 
-- Joined 10M rows just to keep 100K!

after_selection_pushdown.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Optimized: Filter before join
SELECT o.order_id, c.name
FROM orders o
JOIN customers c 
  ON o.customer_id = c.id
WHERE c.country = 'USA'      -- Filter c first
  AND o.order_date > '2024-01-01';
 
-- Conceptual plan (optimized):
-- 1. Scan orders, filter date (500K rows)
-- 2. Scan customers, filter country (50K rows)
-- 3. Join (produces 100K matches)
-- 4. No additional filtering needed
 
-- Joined 10-20x fewer rows!

Projection Pushdown (Column Pruning)

Just as selection reduces rows, projection reduces columns. Pushing projection early eliminates unused columns before they consume memory or bandwidth.

projection_pushdown_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Query requesting only specific columns
SELECT c.name, SUM(o.total_amount)
FROM orders o                        -- orders has 30 columns
JOIN customers c ON o.customer_id = c.id  -- customers has 25 columns
GROUP BY c.name;
 
-- Without projection pushdown:
-- Join processes all 55 columns (30 + 25)
-- Memory per row: potentially 2KB+
-- 1M joined rows = 2GB memory pressure
 
-- With projection pushdown:
-- Orders: only fetch customer_id, total_amount (2 columns)
-- Customers: only fetch id, name (2 columns)
-- Memory per row: ~100 bytes
-- 1M joined rows = 100MB memory pressure
-- 20× reduction in memory usage!

Predicate Derivation

Smart optimizers don't just push down explicit predicates—they derive new predicates from join conditions and transitivity:

Transitivity Derivation: If A.x = B.x and B.x = 10, then A.x = 10. This derived predicate can be pushed to table A.

Range Propagation: If A.date >= B.date and A.date <= '2024-06-30', then B.date <= '2024-06-30'.

predicate_derivation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Original query with join-based filter
SELECT *
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE oi.product_id = 42;
 
-- Derived predicate: Since o.id = oi.order_id,
-- and we're only keeping oi rows where product_id = 42,
-- we can potentially push predicates to orders if we know
-- which orders contain product 42.
 
-- More powerful example:
SELECT *
FROM regions r
JOIN countries c ON r.region_id = c.region_id
JOIN states s ON c.country_id = s.country_id
WHERE r.name = 'North America';
 
-- Optimizer derives:
-- c.region_id must match regions where name='North America'
-- This derived constraint can filter countries early
-- before joining with states

Predicate Pushdown Blockers

Join Ordering Heuristics

The Join Order Impact

To understand why join order matters so much, consider a query joining four tables: A (1,000 rows), B (10,000 rows), C (100 rows), and D (500 rows).

Join Order	Intermediate Sizes (estimated)	Total Work
((A ⋈ B) ⋈ C) ⋈ D	10K → 1K → 500	~11,500
((A ⋈ C) ⋈ B) ⋈ D	100 → 1K → 500	~1,600
((C ⋈ D) ⋈ A) ⋈ B	50 → 50 → 500	~600

The difference is 20× without any cost model—you just need to know table sizes!

Heuristic: Start with Smallest Tables

The most fundamental join ordering heuristic is to start with the smallest tables (after applying filters). This minimizes the size of intermediate results.

smallest_first_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Query joining multiple tables
SELECT *
FROM products p           -- 100K rows
JOIN categories c         -- 50 rows (small!)
  ON p.category_id = c.id
JOIN suppliers s          -- 1K rows
  ON p.supplier_id = s.id
JOIN inventory i          -- 500K rows
  ON p.id = i.product_id;
 
-- Heuristic join order:
-- 1. Start with categories (50 rows) - smallest
-- 2. Join products (filtered by category: ~2K rows)
-- 3. Join suppliers (further filtered: ~2K rows)  
-- 4. Join inventory (final: ~10K rows)
 
-- Alternative (poor) order:
-- 1. Start with inventory (500K rows)
-- 2. Join products (still 500K rows!)
-- 3. Join suppliers (still ~500K!)
-- 4. Join categories (finally filters down)
 
-- First approach: ~15K intermediate rows total
-- Second approach: ~1.5M intermediate rows total
-- 100× difference!

Heuristic: Prefer Selective Joins First

Beyond table size, join selectivity matters. A join that matches few rows per left tuple reduces intermediate size more than a join that matches many rows.

Join Selectivity Indicators

•Primary key joins — Always many-to-one or one-to-one, never expand rows. Prefer these early.
•Unique constraint joins — Similar to PK joins, guaranteed not to expand significantly.
•Foreign key joins — Typically many-to-one from child to parent. Join child → parent, not parent → child.
•Many-to-many joins — Can cause row explosion. Delay these until other filters have reduced data.
•Self-joins — Often many-to-many and expensive. Push filters aggressively before self-joining.

Heuristic: Maintain Join Graph Connectivity

Another key heuristic: don't create Cartesian products. A Cartesian product between tables A (1K rows) and B (1K rows) produces 1M row pairs—almost always disastrous.

When ordering joins, always ensure each new table is connected to a table already in the partial result via a join condition. This prevents accidental Cartesian products.

avoiding_cartesian_products.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Query with chain join pattern: A -> B -> C
SELECT *
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN order_items oi ON o.id = oi.order_id;
 
-- Valid join orders (maintain connectivity):
-- c → o → oi  (customers first, then orders, then items)
-- oi → o → c  (items first, then orders, then customers)
 
-- INVALID order (causes Cartesian product):
-- c → oi (no direct join condition!) → o
-- This would create c × oi Cartesian product
-- before filtering by the join to orders
 
-- For 1000 customers, 10K orders, 100K items:
-- Valid: 1000 → 10K → 100K (progressive)
-- Invalid: 1000 × 100K = 100M rows (!!)

Join Graph Analysis

Subquery Transformation Heuristics

Heuristic optimization includes a rich set of subquery transformations:

Subquery Flattening (Unnesting)

The most important subquery transformation converts subqueries into joins, allowing the optimizer to reason about the entire query holistically.

correlated_subquery_before.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Correlated subquery (executes per row)
SELECT c.name, c.email
FROM customers c
WHERE EXISTS (
    SELECT 1 
    FROM orders o
    WHERE o.customer_id = c.id
    AND o.total > 1000
);
 
-- Execution model:
-- For EACH customer (1M iterations):
--   Run subquery on orders
--   Check if any order > 1000
-- 
-- Total: ~1M subquery executions!

correlated_subquery_after.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Flattened to semi-join
SELECT DISTINCT c.name, c.email
FROM customers c
SEMI JOIN orders o
  ON o.customer_id = c.id
  AND o.total > 1000;
 
-- Or using explicit JOIN + DISTINCT:
SELECT DISTINCT c.name, c.email
FROM customers c
JOIN orders o
  ON o.customer_id = c.id
WHERE o.total > 1000;
 
-- Execution: Single join operation
-- Uses hash join or index join
-- Much faster for large tables!

IN Subquery Conversion

IN subqueries are converted to semi-joins, while NOT IN subqueries become anti-joins. These transformations are particularly impactful:

in_subquery_conversion.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- IN subquery (common pattern)
SELECT product_name
FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE department = 'Electronics'
);
 
-- Converted to semi-join:
SELECT p.product_name
FROM products p
SEMI JOIN categories c
  ON p.category_id = c.id
WHERE c.department = 'Electronics';
 
-- Optimizer can now:
-- 1. Push 'department = Electronics' to categories scan
-- 2. Build hash table on filtered categories
-- 3. Probe with products
-- 4. Uses hash semi-join: O(m + n) instead of O(m × n)
 
-- NOT IN requires anti-join:
SELECT product_name FROM products
WHERE category_id NOT IN (
    SELECT id FROM categories WHERE discontinued = true
);
 
-- Becomes:
SELECT p.product_name
FROM products p
ANTI JOIN categories c
  ON p.category_id = c.id AND c.discontinued = true;

Scalar Subquery Caching

Scalar subqueries (returning a single value) in the SELECT clause can often be cached or converted to outer joins:

scalar_subquery_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Scalar subquery in SELECT (potentially expensive)
SELECT 
    o.order_id,
    o.total,
    (SELECT name FROM customers WHERE id = o.customer_id) as customer_name
FROM orders o;
 
-- Converted to LEFT JOIN:
SELECT 
    o.order_id,
    o.total,
    c.name as customer_name
FROM orders o
LEFT JOIN customers c ON c.id = o.customer_id;
 
-- Benefits:
-- 1. Single join operation vs. per-row subquery
-- 2. Can use hash or index join
-- 3. Optimizer can reorder with other joins
-- 4. Predicates on c.name can be pushed down

Subquery Flattening Limitations

Expression Optimization Heuristics

Beyond restructuring query plans, optimizers apply numerous expression-level transformations. These micro-optimizations accumulate to produce significant performance improvements.

Predicate Simplification

Boolean expressions are simplified using standard logical identities:

Boolean Simplification Rules
Pattern	Simplifies To	Rule Name
`A AND TRUE`	`A`	Identity
`A OR FALSE`	`A`	Identity
`A AND FALSE`	`FALSE`	Annihilation
`A OR TRUE`	`TRUE`	Annihilation
`A AND A`	`A`	Idempotence
`A OR A`	`A`	Idempotence
`A AND NOT A`	`FALSE`	Contradiction
`A OR NOT A`	`TRUE`	Tautology
`NOT (NOT A)`	`A`	Double Negation
`A AND (A OR B)`	`A`	Absorption
`NOT (A AND B)`	`NOT A OR NOT B`	De Morgan's

Comparison Simplification

Comparison predicates have their own optimization rules:

comparison_simplification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Range overlap detection
-- Before: Two overlapping range conditions
WHERE price >= 50 AND price >= 30 AND price <= 100 AND price < 90
 
-- After: Simplified to tightest bounds
WHERE price >= 50 AND price < 90
 
-- Impossible range detection
-- Before: Contradictory conditions
WHERE status = 'active' AND status = 'inactive'
-- After: Replaced with FALSE (query returns empty)
 
-- IN list deduplication
-- Before: Duplicates in IN list
WHERE category IN ('A', 'B', 'A', 'C', 'B')
-- After: Unique values
WHERE category IN ('A', 'B', 'C')
 
-- Single-value IN conversion
-- Before: IN with single element
WHERE id IN (42)
-- After: Converted to equality (enables index seek)
WHERE id = 42
 
-- BETWEEN normalization
-- Before: Using BETWEEN
WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
-- After (internally): Converted to range
WHERE date >= '2024-01-01' AND date <= '2024-12-31'

Arithmetic Expression Optimization

Arithmetic expressions undergo several transformations:

Arithmetic Optimizations

•Constant Folding — price * 1.1 * 2 → price * 2.2 (compute at optimization time)
•Identity Elimination — amount + 0 → amount, quantity * 1 → quantity
•Zero Annihilation — anything * 0 → 0 (when anything is NOT NULL)
•Strength Reduction — price * 2 → price + price (addition is faster than multiplication on some systems)
•Common Subexpression — (a+b)*c + (a+b)*d → let temp = a+b in temp*c + temp*d

Predicate Extraction for Indexing

One of the most important expression transformations rewrites predicates to enable index usage:

predicate_extraction_for_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Problem: Function on column prevents index use
-- If we have an index on order_date:
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- Cannot use index! Must scan all rows and evaluate YEAR()
 
-- Solution: Rewrite as range predicate
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Now uses index range scan!
 
-- Similarly for string operations:
-- Cannot use index:
SELECT * FROM products WHERE UPPER(name) = 'WIDGET';
 
-- Can use index (if case-insensitive collation):
SELECT * FROM products WHERE name = 'widget';
 
-- Or create functional index:
CREATE INDEX idx_upper_name ON products(UPPER(name));
-- Now original query can use index
 
-- Numeric transformations:
-- Cannot use index:
SELECT * FROM accounts WHERE balance + 100 > 500;
 
-- Rewrite to isolate column:
SELECT * FROM accounts WHERE balance > 400;
-- Now uses index on balance!

SARGability

Aggregate Optimization Heuristics

Aggregate operations (GROUP BY, DISTINCT, COUNT, SUM, etc.) are often the most expensive parts of analytical queries. Heuristic optimizations target aggregate efficiency:

Early Aggregation (Push Aggregate Down)

Partial aggregates can sometimes be computed before joins, dramatically reducing join input sizes:

late_aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Before: Aggregate after join
SELECT 
    c.region,
    SUM(o.total)
FROM orders o
JOIN customers c 
  ON o.customer_id = c.id
GROUP BY c.region;
 
-- Execution:
-- 1. Join orders (10M) with customers (1M)
-- 2. Produces 10M joined rows
-- 3. Group 10M rows into 10 regions
 
-- Memory: 10M rows during aggregation

early_aggregation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- After: Partial aggregate before join
SELECT 
    c.region,
    SUM(partial.customer_total)
FROM customers c
JOIN (
    SELECT customer_id, SUM(total) as customer_total
    FROM orders
    GROUP BY customer_id
) partial ON partial.customer_id = c.id
GROUP BY c.region;
 
-- Execution:
-- 1. Aggregate orders by customer (10M → 500K)
-- 2. Join 500K with customers
-- 3. Group 500K rows into 10 regions
 
-- Memory: 20× reduction!

Aggregate Function Transformations

Some aggregate functions can be transformed into more efficient forms:

aggregate_transformations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- COUNT(*) optimization
-- Before: Requires reading all columns
SELECT COUNT(*) FROM large_table;
-- Optimized: Can use smallest index or table metadata
-- Some systems store row counts in metadata
 
-- COUNT(column) vs COUNT(*)
SELECT COUNT(nullable_column) FROM table;
-- Must actually check for NULLs, cannot use metadata
 
-- AVG decomposition for parallel aggregation
-- AVG(x) = SUM(x) / COUNT(x)
-- Can compute SUM and COUNT in parallel, combine at end
 
-- DISTINCT aggregates
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- Optimizer may use:
-- 1. Hash-based distinct + count
-- 2. Index scan to retrieve distinct values
-- 3. Bloom filter for approximate counting
 
-- MIN/MAX with index
SELECT MAX(order_date) FROM orders;
-- If index exists on order_date:
-- Single index lookup for rightmost entry
-- O(log n) instead of O(n)!

Group-By Elimination

Sometimes GROUP BY is unnecessary and can be eliminated:

GROUP BY Elimination Cases

•Grouping on key — GROUP BY primary_key_column produces one group per row anyway, so GROUP BY is eliminated.
•Functional dependency — If X determines Y (X → Y), grouping by (X, Y) is the same as grouping by X alone.
•Single-group aggregation — SELECT SUM(total) FROM orders has an implicit single group, no grouping operation needed.
•Grouping with LIMIT 1 — SELECT x FROM t GROUP BY x LIMIT 1 can be rewritten as SELECT DISTINCT x FROM t LIMIT 1.

Aggregate Pushdown Prerequisites

Summary: Common Heuristics

Key Takeaways

•Meta-heuristics guide all optimization — Reduce early, avoid redundancy, prefer pipeline, exploit order, minimize I/O. These principles underlie every specific transformation.
•Data volume reduction is paramount — Selection and projection pushdown are the highest-impact optimizations. Reducing row and column counts early pays dividends at every subsequent stage.
•Join ordering follows practical rules — Start with smallest tables, prefer selective joins, never create Cartesian products. These heuristics yield good plans without exhaustive enumeration.
•Subquery transformation unlocks optimization — Converting correlated subqueries to joins enables the optimizer to reason globally about the query, often improving performance by orders of magnitude.
•Expression optimization enables index usage — Simplifying predicates and rewriting non-SARGable conditions allows efficient index access paths.
•Aggregate optimization reduces memory pressure — Early aggregation, function decomposition, and GROUP BY elimination minimize the data flowing through expensive grouping operations.

What's Next:

Page Complete

2 / 5