Loading learning content...
Every query optimizer maintains a playbook—a collection of battle-tested strategies accumulated over decades of database system development. These strategies, known as heuristics, are transformations that improve query performance in the vast majority of cases without requiring detailed knowledge about data distributions.
Unlike cost-based decisions that weigh alternatives using statistics, heuristics encode universal wisdom: reduce data volume early, minimize expensive operations, and structure computations to exploit available optimizations. These principles transcend any particular database system or workload.
By the end of this page, you will understand the major categories of heuristic optimizations, the specific transformations within each category, why these heuristics work in practice, and how to recognize when they can dramatically improve query performance. You'll develop intuition for the core principles that guide query optimization.
Not all heuristics are equal in their impact or applicability. They form a hierarchy based on how universally beneficial and how significant their performance improvements typically are.
Heuristics can be organized into tiers based on their typical impact:
Underlying all specific transformation rules are a few meta-heuristics—abstract principles that guide optimizer design:
| Meta-Heuristic | Principle | Manifestations |
|---|---|---|
| Reduce Early | Minimize data volume as early as possible in the plan | Selection pushdown, projection pushdown, early aggregation |
| Avoid Redundancy | Never compute the same result twice | Common subexpression elimination, view caching, memoization |
| Prefer Pipeline | Keep data flowing rather than materializing intermediate results | Avoid sorts when possible, prefer streaming joins |
| Exploit Order | Use existing sort order instead of creating new ones | Index scan exploitation, merge join when data is pre-sorted |
| Minimize I/O | Disk access dominates cost; design plans to minimize it | Index-only scans, covering indexes, sequential access patterns |
When analyzing a query plan or designing optimizations, always ask: 'Am I reducing early? Avoiding redundancy? Exploiting available order?' These questions reveal optimization opportunities that specific rules might miss.
The most impactful heuristics focus on reducing the amount of data that flows through the query plan. Less data means less memory pressure, fewer disk I/Os, and faster execution at every subsequent stage.
The principle is simple: if you're going to filter out rows, do it as early as possible. Every row filtered early is a row that doesn't need to be joined, grouped, or sorted.
123456789101112131415
-- Unoptimized: Filter after joinSELECT o.order_id, c.nameFROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.country = 'USA' AND o.order_date > '2024-01-01'; -- Conceptual plan:-- 1. Scan all orders (10M rows)-- 2. Scan all customers (1M rows)-- 3. Join (produces 10M matches)-- 4. Filter (reduces to 100K rows) -- Joined 10M rows just to keep 100K!123456789101112131415
-- Optimized: Filter before joinSELECT o.order_id, c.nameFROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.country = 'USA' -- Filter c first AND o.order_date > '2024-01-01'; -- Conceptual plan (optimized):-- 1. Scan orders, filter date (500K rows)-- 2. Scan customers, filter country (50K rows)-- 3. Join (produces 100K matches)-- 4. No additional filtering needed -- Joined 10-20x fewer rows!Just as selection reduces rows, projection reduces columns. Pushing projection early eliminates unused columns before they consume memory or bandwidth.
Consider a table with 50 columns, where the query only needs 3. Without projection pushdown, every operation handles all 50 columns. With pushdown, operations handle only the 3 required columns—potentially 16× less memory per row.
1234567891011121314151617
-- Query requesting only specific columnsSELECT c.name, SUM(o.total_amount)FROM orders o -- orders has 30 columnsJOIN customers c ON o.customer_id = c.id -- customers has 25 columnsGROUP BY c.name; -- Without projection pushdown:-- Join processes all 55 columns (30 + 25)-- Memory per row: potentially 2KB+-- 1M joined rows = 2GB memory pressure -- With projection pushdown:-- Orders: only fetch customer_id, total_amount (2 columns)-- Customers: only fetch id, name (2 columns)-- Memory per row: ~100 bytes-- 1M joined rows = 100MB memory pressure-- 20× reduction in memory usage!Smart optimizers don't just push down explicit predicates—they derive new predicates from join conditions and transitivity:
Transitivity Derivation: If A.x = B.x and B.x = 10, then A.x = 10. This derived predicate can be pushed to table A.
Range Propagation: If A.date >= B.date and A.date <= '2024-06-30', then B.date <= '2024-06-30'.
12345678910111213141516171819202122
-- Original query with join-based filterSELECT *FROM orders oJOIN order_items oi ON o.id = oi.order_idWHERE oi.product_id = 42; -- Derived predicate: Since o.id = oi.order_id,-- and we're only keeping oi rows where product_id = 42,-- we can potentially push predicates to orders if we know-- which orders contain product 42. -- More powerful example:SELECT *FROM regions rJOIN countries c ON r.region_id = c.region_idJOIN states s ON c.country_id = s.country_idWHERE r.name = 'North America'; -- Optimizer derives:-- c.region_id must match regions where name='North America'-- This derived constraint can filter countries early-- before joining with statesNot all predicates can be pushed down. Outer joins block pushdown on the nullable side. Aggregations block pushdown of predicates that reference aggregate results. Volatile functions (like RANDOM()) cannot be pushed because they must be evaluated exactly once at their original position.
When queries involve multiple joins, the order in which tables are joined can affect performance by orders of magnitude. While cost-based optimizers use statistics to determine optimal join order, heuristic approaches use simpler rules that work well in practice.
To understand why join order matters so much, consider a query joining four tables: A (1,000 rows), B (10,000 rows), C (100 rows), and D (500 rows).
| Join Order | Intermediate Sizes (estimated) | Total Work |
|---|---|---|
| ((A ⋈ B) ⋈ C) ⋈ D | 10K → 1K → 500 | ~11,500 |
| ((A ⋈ C) ⋈ B) ⋈ D | 100 → 1K → 500 | ~1,600 |
| ((C ⋈ D) ⋈ A) ⋈ B | 50 → 50 → 500 | ~600 |
The difference is 20× without any cost model—you just need to know table sizes!
The most fundamental join ordering heuristic is to start with the smallest tables (after applying filters). This minimizes the size of intermediate results.
12345678910111213141516171819202122232425
-- Query joining multiple tablesSELECT *FROM products p -- 100K rowsJOIN categories c -- 50 rows (small!) ON p.category_id = c.idJOIN suppliers s -- 1K rows ON p.supplier_id = s.idJOIN inventory i -- 500K rows ON p.id = i.product_id; -- Heuristic join order:-- 1. Start with categories (50 rows) - smallest-- 2. Join products (filtered by category: ~2K rows)-- 3. Join suppliers (further filtered: ~2K rows) -- 4. Join inventory (final: ~10K rows) -- Alternative (poor) order:-- 1. Start with inventory (500K rows)-- 2. Join products (still 500K rows!)-- 3. Join suppliers (still ~500K!)-- 4. Join categories (finally filters down) -- First approach: ~15K intermediate rows total-- Second approach: ~1.5M intermediate rows total-- 100× difference!Beyond table size, join selectivity matters. A join that matches few rows per left tuple reduces intermediate size more than a join that matches many rows.
Another key heuristic: don't create Cartesian products. A Cartesian product between tables A (1K rows) and B (1K rows) produces 1M row pairs—almost always disastrous.
When ordering joins, always ensure each new table is connected to a table already in the partial result via a join condition. This prevents accidental Cartesian products.
123456789101112131415161718
-- Query with chain join pattern: A -> B -> CSELECT *FROM customers cJOIN orders o ON c.id = o.customer_idJOIN order_items oi ON o.id = oi.order_id; -- Valid join orders (maintain connectivity):-- c → o → oi (customers first, then orders, then items)-- oi → o → c (items first, then orders, then customers) -- INVALID order (causes Cartesian product):-- c → oi (no direct join condition!) → o-- This would create c × oi Cartesian product-- before filtering by the join to orders -- For 1000 customers, 10K orders, 100K items:-- Valid: 1000 → 10K → 100K (progressive)-- Invalid: 1000 × 100K = 100M rows (!!)Modern optimizers build a 'join graph' where tables are nodes and join predicates are edges. Valid join orders correspond to connected subgraphs. Some advanced techniques (like Cartesian product injection) deliberately add cross-joins when other heuristics suggest it might help—but this requires cost-based evaluation.
Subqueries, while powerful for expressing complex conditions, can create significant optimization barriers. Correlated subqueries are particularly problematic—they execute the subquery for each row of the outer query, often causing O(n²) behavior.
Heuristic optimization includes a rich set of subquery transformations:
The most important subquery transformation converts subqueries into joins, allowing the optimizer to reason about the entire query holistically.
12345678910111213141516
-- Correlated subquery (executes per row)SELECT c.name, c.emailFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.total > 1000); -- Execution model:-- For EACH customer (1M iterations):-- Run subquery on orders-- Check if any order > 1000-- -- Total: ~1M subquery executions!1234567891011121314151617
-- Flattened to semi-joinSELECT DISTINCT c.name, c.emailFROM customers cSEMI JOIN orders o ON o.customer_id = c.id AND o.total > 1000; -- Or using explicit JOIN + DISTINCT:SELECT DISTINCT c.name, c.emailFROM customers cJOIN orders o ON o.customer_id = c.idWHERE o.total > 1000; -- Execution: Single join operation-- Uses hash join or index join-- Much faster for large tables!IN subqueries are converted to semi-joins, while NOT IN subqueries become anti-joins. These transformations are particularly impactful:
12345678910111213141516171819202122232425262728293031
-- IN subquery (common pattern)SELECT product_nameFROM productsWHERE category_id IN ( SELECT id FROM categories WHERE department = 'Electronics'); -- Converted to semi-join:SELECT p.product_nameFROM products pSEMI JOIN categories c ON p.category_id = c.idWHERE c.department = 'Electronics'; -- Optimizer can now:-- 1. Push 'department = Electronics' to categories scan-- 2. Build hash table on filtered categories-- 3. Probe with products-- 4. Uses hash semi-join: O(m + n) instead of O(m × n) -- NOT IN requires anti-join:SELECT product_name FROM productsWHERE category_id NOT IN ( SELECT id FROM categories WHERE discontinued = true); -- Becomes:SELECT p.product_nameFROM products pANTI JOIN categories c ON p.category_id = c.id AND c.discontinued = true;Scalar subqueries (returning a single value) in the SELECT clause can often be cached or converted to outer joins:
1234567891011121314151617181920
-- Scalar subquery in SELECT (potentially expensive)SELECT o.order_id, o.total, (SELECT name FROM customers WHERE id = o.customer_id) as customer_nameFROM orders o; -- Converted to LEFT JOIN:SELECT o.order_id, o.total, c.name as customer_nameFROM orders oLEFT JOIN customers c ON c.id = o.customer_id; -- Benefits:-- 1. Single join operation vs. per-row subquery-- 2. Can use hash or index join-- 3. Optimizer can reorder with other joins-- 4. Predicates on c.name can be pushed downNot all subqueries can be safely flattened. Subqueries with LIMIT, DISTINCT, GROUP BY, or aggregate functions may change semantics if flattened naively. Additionally, NOT IN with nullable columns has tricky NULL semantics that complicate anti-join conversion. Modern optimizers include sophisticated analysis to detect these cases.
Beyond restructuring query plans, optimizers apply numerous expression-level transformations. These micro-optimizations accumulate to produce significant performance improvements.
Boolean expressions are simplified using standard logical identities:
| Pattern | Simplifies To | Rule Name |
|---|---|---|
A AND TRUE | A | Identity |
A OR FALSE | A | Identity |
A AND FALSE | FALSE | Annihilation |
A OR TRUE | TRUE | Annihilation |
A AND A | A | Idempotence |
A OR A | A | Idempotence |
A AND NOT A | FALSE | Contradiction |
A OR NOT A | TRUE | Tautology |
NOT (NOT A) | A | Double Negation |
A AND (A OR B) | A | Absorption |
NOT (A AND B) | NOT A OR NOT B | De Morgan's |
Comparison predicates have their own optimization rules:
1234567891011121314151617181920212223242526272829
-- Range overlap detection-- Before: Two overlapping range conditionsWHERE price >= 50 AND price >= 30 AND price <= 100 AND price < 90 -- After: Simplified to tightest boundsWHERE price >= 50 AND price < 90 -- Impossible range detection-- Before: Contradictory conditionsWHERE status = 'active' AND status = 'inactive'-- After: Replaced with FALSE (query returns empty) -- IN list deduplication-- Before: Duplicates in IN listWHERE category IN ('A', 'B', 'A', 'C', 'B')-- After: Unique valuesWHERE category IN ('A', 'B', 'C') -- Single-value IN conversion-- Before: IN with single elementWHERE id IN (42)-- After: Converted to equality (enables index seek)WHERE id = 42 -- BETWEEN normalization-- Before: Using BETWEENWHERE date BETWEEN '2024-01-01' AND '2024-12-31'-- After (internally): Converted to rangeWHERE date >= '2024-01-01' AND date <= '2024-12-31'Arithmetic expressions undergo several transformations:
price * 1.1 * 2 → price * 2.2 (compute at optimization time)amount + 0 → amount, quantity * 1 → quantityanything * 0 → 0 (when anything is NOT NULL)price * 2 → price + price (addition is faster than multiplication on some systems)(a+b)*c + (a+b)*d → let temp = a+b in temp*c + temp*dOne of the most important expression transformations rewrites predicates to enable index usage:
12345678910111213141516171819202122232425262728
-- Problem: Function on column prevents index use-- If we have an index on order_date:SELECT * FROM orders WHERE YEAR(order_date) = 2024;-- Cannot use index! Must scan all rows and evaluate YEAR() -- Solution: Rewrite as range predicateSELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';-- Now uses index range scan! -- Similarly for string operations:-- Cannot use index:SELECT * FROM products WHERE UPPER(name) = 'WIDGET'; -- Can use index (if case-insensitive collation):SELECT * FROM products WHERE name = 'widget'; -- Or create functional index:CREATE INDEX idx_upper_name ON products(UPPER(name));-- Now original query can use index -- Numeric transformations:-- Cannot use index:SELECT * FROM accounts WHERE balance + 100 > 500; -- Rewrite to isolate column:SELECT * FROM accounts WHERE balance > 400;-- Now uses index on balance!Predicates that can use indexes are called 'SARGable' (Search ARGument able). A key optimization goal is transforming non-SARGable predicates into SARGable forms. As a developer, writing SARGable predicates directly saves the optimizer work and ensures index usage.
Aggregate operations (GROUP BY, DISTINCT, COUNT, SUM, etc.) are often the most expensive parts of analytical queries. Heuristic optimizations target aggregate efficiency:
Partial aggregates can sometimes be computed before joins, dramatically reducing join input sizes:
123456789101112131415
-- Before: Aggregate after joinSELECT c.region, SUM(o.total)FROM orders oJOIN customers c ON o.customer_id = c.idGROUP BY c.region; -- Execution:-- 1. Join orders (10M) with customers (1M)-- 2. Produces 10M joined rows-- 3. Group 10M rows into 10 regions -- Memory: 10M rows during aggregation123456789101112131415161718
-- After: Partial aggregate before joinSELECT c.region, SUM(partial.customer_total)FROM customers cJOIN ( SELECT customer_id, SUM(total) as customer_total FROM orders GROUP BY customer_id) partial ON partial.customer_id = c.idGROUP BY c.region; -- Execution:-- 1. Aggregate orders by customer (10M → 500K)-- 2. Join 500K with customers-- 3. Group 500K rows into 10 regions -- Memory: 20× reduction!Some aggregate functions can be transformed into more efficient forms:
1234567891011121314151617181920212223242526
-- COUNT(*) optimization-- Before: Requires reading all columnsSELECT COUNT(*) FROM large_table;-- Optimized: Can use smallest index or table metadata-- Some systems store row counts in metadata -- COUNT(column) vs COUNT(*)SELECT COUNT(nullable_column) FROM table;-- Must actually check for NULLs, cannot use metadata -- AVG decomposition for parallel aggregation-- AVG(x) = SUM(x) / COUNT(x)-- Can compute SUM and COUNT in parallel, combine at end -- DISTINCT aggregatesSELECT COUNT(DISTINCT customer_id) FROM orders;-- Optimizer may use:-- 1. Hash-based distinct + count-- 2. Index scan to retrieve distinct values-- 3. Bloom filter for approximate counting -- MIN/MAX with indexSELECT MAX(order_date) FROM orders;-- If index exists on order_date:-- Single index lookup for rightmost entry-- O(log n) instead of O(n)!Sometimes GROUP BY is unnecessary and can be eliminated:
GROUP BY primary_key_column produces one group per row anyway, so GROUP BY is eliminated.SELECT SUM(total) FROM orders has an implicit single group, no grouping operation needed.SELECT x FROM t GROUP BY x LIMIT 1 can be rewritten as SELECT DISTINCT x FROM t LIMIT 1.Aggregate pushdown is only valid when the aggregate function is decomposable (can compute partial results that combine to the final result). SUM, COUNT, MIN, MAX are decomposable. AVG can be decomposed as SUM/COUNT. MEDIAN and percentiles are NOT decomposable and require different optimization strategies.
We've surveyed the major categories of heuristic optimizations that form the backbone of query optimization. These patterns, refined over decades of database development, encode practical wisdom about efficient query execution.
What's Next:
We've covered common heuristics broadly. The next pages dive deep into two of the most impactful: Selection Before Join (predicate pushdown) and Projection Early (column elimination). These specific techniques warrant detailed exploration given their critical role in query optimization.
You now have a comprehensive map of heuristic query optimization techniques. These patterns provide the foundation for understanding both rule-based optimization and the logical optimization phase of cost-based optimizers.