Loading content...
Heuristic optimization is powerful—selection pushdown, early projection, and join ordering rules dramatically improve most queries. But these rules encode general wisdom, not specific knowledge. They work well when their assumptions hold, but fail—sometimes catastrophically—when reality deviates from expectations.
Understanding where heuristics break down is essential for database practitioners. This knowledge helps you recognize when to trust automatic optimization, when to provide hints, and when to restructure queries or schemas to help the optimizer succeed.
By the end of this page, you will understand the fundamental limitations of rule-based optimization, specific scenarios where common heuristics fail, how cost-based optimization addresses these limitations, and practical strategies for working around heuristic limitations.
The core limitation of heuristic optimization is simple: rules don't know your data. Heuristics encode patterns that are usually beneficial, but "usually" is not "always."
Consider the heuristic "push selections down." It assumes filtering reduces data. But what if the filter matches everything?
123456789101112131415161718
-- Table: transactions (100 million rows)-- Column: is_valid (99.9% are 'Y', 0.1% are 'N') -- Query 1: Select invalid transactionsSELECT * FROM transactions WHERE is_valid = 'N';-- Selectivity: 0.1% → 100,000 rows-- Index on is_valid: EXCELLENT choice-- Heuristic: Push down, use index ✓ -- Query 2: Select valid transactions SELECT * FROM transactions WHERE is_valid = 'Y';-- Selectivity: 99.9% → 99,900,000 rows-- Index on is_valid: TERRIBLE choice (worse than full scan!)-- Heuristic: Push down, use index ✗ -- The heuristic "push selection and use index" is applied equally,-- but one query benefits and one suffers.-- Without knowing selectivity, the optimizer can't choose correctly.Join ordering heuristics like "start with the smallest table" fail when filtered cardinality differs dramatically from table size:
1234567891011
-- Scenario A: Heuristic succeeds-- customers: 100,000 rows-- orders: 10,000,000 rows-- Filter: none -- Heuristic: Start with customers (smaller)-- Build hash on 100K customers-- Probe with 10M orders-- Result: 100K hash table entries -- This is optimal!1234567891011121314
-- Scenario B: Heuristic fails-- customers: 100,000 rows -- orders: 10,000,000 rows-- Filter: orders.date = TODAY (100 rows!) -- Heuristic: Still starts with customers-- Build hash on 100K customers-- Probe with 100 filtered orders-- Result: 100K hash entries for 100 probes! -- Better: Start with filtered orders-- Build hash on 100 orders-- Probe with customers (semi-join optimization)-- Result: 100 hash entries, 100K probesThe heuristic "start with the smallest table" doesn't account for filters that dramatically change effective cardinality. A table might have 10 million rows but filter down to 100—making it effectively smaller than a "small" table with 100,000 rows.
When heuristics fail, they don't fail gracefully. A wrong join order can be 1000× slower than the right one. A poorly chosen index scan can transform a millisecond query into a minute-long operation. The cost of heuristic failure is often catastrophic, not incremental.
Heuristics implicitly assume certain data distributions. When real data violates these assumptions, optimization decisions become suboptimal.
Many heuristics assume uniform data distribution—that any value is equally likely:
12345678910111213141516171819202122
-- Table: sales (10 million rows)-- Column: product_id (1000 distinct products)-- Uniform assumption: Each product has ~10,000 sales -- Reality: Pareto distribution (80-20 rule)-- Top 10 products: 5 million sales (500K each)-- Bottom 900 products: 1 million sales (~1,100 each) -- Query for popular product:SELECT * FROM sales WHERE product_id = 1;-- Expected (uniform): 10,000 rows-- Actual: 500,000 rows (50× more!)-- Wrong plan chosen based on wrong estimate -- Query for unpopular product:SELECT * FROM sales WHERE product_id = 999;-- Expected (uniform): 10,000 rows -- Actual: 1,100 rows (9× fewer!)-- Index would be great, but optimizer might not realize -- Consequence: Same heuristic applied to both queries,-- but optimal plans are completely different!Heuristics treat predicates independently, but real data often has correlated columns:
123456789101112131415161718192021
-- Table: employees-- department: 10 distinct values (10% each under uniformity)-- job_title: 100 distinct values (1% each under uniformity) -- Independent assumption:-- P(department='Engineering') = 10%-- P(job_title='Software Engineer') = 1%-- P(both) = 10% × 1% = 0.1% -- Reality: Columns are correlated!-- All 'Software Engineer' titles are in 'Engineering'-- P(department='Engineering' AND job_title='Software Engineer') = 1%-- That's 10× higher than the independent estimate! SELECT * FROM employees WHERE department = 'Engineering' AND job_title = 'Software Engineer'; -- Optimizer estimates: 0.1% of table (100 rows from 100K)-- Actual result: 1% of table (1000 rows)-- 10× underestimate leads to wrong plan selectionHeuristics often ignore or mishandle NULL values, which have special semantics in SQL:
| Scenario | Heuristic Assumption | Reality | Impact |
|---|---|---|---|
| NOT IN with NULLs | NOT IN behaves like anti-join | Any NULL makes NOT IN return FALSE | Wrong results or missed optimization |
| Index on nullable column | Index useful for equality | IS NULL may not use index efficiently | Full scan when index expected |
| Aggregate on nullable column | COUNT(col) = COUNT(*) | COUNT(col) excludes NULLs | Wrong cardinality estimates |
| Outer join predicates | Predicate pushdown is safe | Pushdown may eliminate NULL-extended rows | Wrong results |
Cost-based optimizers address distribution issues by maintaining histograms (recording actual value frequencies), correlation statistics (for column pairs), and NULL frequency metadata. This data-awareness enables informed decisions where heuristics must guess.
Heuristics operate on logical query structure, ignoring physical resources like memory, I/O bandwidth, and CPU capacity. This blindness leads to suboptimal decisions in resource-constrained scenarios.
The heuristic "use hash join for equi-joins" ignores available memory:
1234567891011121314151617181920212223
-- Hash join assumption: Build table fits in memory-- Available memory: 1GB (work_mem or equivalent) -- Scenario A: Small build sideSELECT * FROM small_table s -- 100MB build sideJOIN large_table l ON s.id = l.small_id;-- Hash join: Build 100MB hash table, probe in memory-- Excellent performance! -- Scenario B: Large build side SELECT * FROM medium_table m -- 10GB build sideJOIN huge_table h ON m.id = h.medium_id;-- Hash join: Build 10GB hash table...-- But only 1GB memory available!-- Spills to disk repeatedly-- Performance: 10-100× worse than expected -- Better for Scenario B: Sort-merge join-- If data is already sorted or indexes exist-- Avoids memory spills, streams through data -- Heuristic blindly chose hash join for both,-- but sort-merge would be faster for Scenario BHeuristics don't consider whether data is in cache, on SSD, or on spinning disk:
Heuristics rarely incorporate parallelism considerations:
12345678910111213141516171819202122232425
-- Consider two equivalent plans for aggregation: -- Plan A: Sort-based aggregation-- Requires full sort before aggregate-- Sort cannot parallelize past reduce step-- Must merge sorted chunks sequentially -- Plan B: Hash-based aggregation -- Hash partitions independently-- Each partition aggregates in parallel-- Combines at end -- On single-core: Plan A might be faster (lower overhead)-- On 64-core: Plan B is likely 10× faster (parallelizable) -- Heuristic might prefer sort-based (simpler algorithm)-- without considering available parallelism -- Similarly for joins:-- Nested loop: Limited parallelism-- Hash join: Highly parallelizable-- Sort-merge: Moderate parallelism -- The "best" algorithm depends on available cores,-- something heuristics don't considerModern cost-based optimizers include resource models: memory budgets for sorts/hashes, I/O cost parameters for different storage types, and parallelism factors. This enables intelligent algorithm selection that heuristics cannot provide.
Certain query patterns are intrinsically difficult for heuristic optimization. These patterns require sophisticated analysis that simple rules cannot provide.
With more than a few tables, join ordering becomes combinatorially complex:
1234567891011121314151617181920212223242526
-- Query joining 8 tablesSELECT *FROM orders oJOIN customers c ON o.customer_id = c.idJOIN products p ON o.product_id = p.idJOIN categories cat ON p.category_id = cat.idJOIN suppliers s ON p.supplier_id = s.id JOIN regions r ON c.region_id = r.idJOIN warehouses w ON o.warehouse_id = w.idJOIN shipping sh ON o.shipping_method = sh.idWHERE cat.name = 'Electronics' AND r.name = 'North America'; -- Possible join orderings: (2n-2)! / (n-1)!-- For n=8: 135,135 possible orderings! -- Heuristic approach: Apply rules greedily-- "Start with smallest filtered table" → categories (filter applied)-- Then join products, then... which next? -- Problem: Greedy choices early on may lock in bad decisions-- Optimal might require starting with customers (for region filter)-- then joining orders, then products, then categories -- Heuristics can't explore the full search space-- May miss orderings that are 100× fasterHeuristics may fail to decorrelate subqueries optimally:
1234567891011121314151617181920212223242526
-- Complex correlated subquerySELECT o.id, o.total, (SELECT AVG(o2.total) FROM orders o2 WHERE o2.customer_id = o.customer_id AND o2.order_date < o.order_date) as prior_avgFROM orders oWHERE o.total > 1000; -- Heuristic: Execute subquery for each outer row-- For 1M orders: 1M subquery executions! -- Optimal: Decorrelate to window functionSELECT id, total, AVG(total) OVER ( PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as prior_avgFROM ordersWHERE total > 1000; -- But automatic decorrelation is complex!-- Requires recognizing the pattern-- Verifying semantic equivalence-- Complex subqueries may not decorrelate at allRecursive patterns pose particular challenges for heuristic optimization:
123456789101112131415161718192021222324252627282930
-- Recursive CTE for tree traversalWITH RECURSIVE descendants AS ( -- Base case SELECT id, name, parent_id, 0 as depth FROM categories WHERE id = 1 UNION ALL -- Recursive case SELECT c.id, c.name, c.parent_id, d.depth + 1 FROM categories c JOIN descendants d ON c.parent_id = d.id)SELECT * FROM descendants WHERE depth <= 3; -- Challenges for heuristics:-- 1. Unknown iteration count (depends on tree depth)-- 2. Unknown intermediate result size (depends on branching)-- 3. Termination condition affects resource allocation-- 4. Should results be materialized? Depends on reuse -- Broader issue: Recursive semantics are fundamentally-- different from single-pass relational operations-- Heuristics designed for single-pass don't apply -- Some optimizations possible:-- - Push depth filter into recursion (WHERE depth <= 3)-- - Recognize linear vs. exponential growth patterns-- But requires specialized analysisOptimizers have a 'horizon'—the complexity limit beyond which they give up on exhaustive analysis and fall back to heuristics. Complex queries with many joins, subqueries, or recursive elements often exceed this horizon, receiving suboptimal plans. Breaking queries into simpler parts can help.
Some heuristic transformations that appear safe are actually unsafe in edge cases. These semantic subtleties cause rare but serious bugs.
The classic pitfall of SQL semantics:
12345678910111213141516171819202122232425262728293031323334353637
-- Heuristic: Convert NOT IN to anti-join-- Looks equivalent, but... -- Original querySELECT * FROM orders WHERE customer_id NOT IN ( SELECT id FROM blocked_customers); -- Heuristic transformation to anti-join:SELECT o.* FROM orders oWHERE NOT EXISTS ( SELECT 1 FROM blocked_customers b WHERE b.id = o.customer_id); -- PROBLEM: These are NOT equivalent if blocked_customers.id can be NULL! -- NOT IN semantics:-- If blocked_customers contains [1, 2, NULL]:-- customer_id = 3: NOT IN returns UNKNOWN (due to NULL comparison)-- UNKNOWN in WHERE = FALSE → row excluded!-- Result: NO rows returned (or very few) -- Anti-join semantics:-- Checks if customer_id matches any blocked id-- NULL in blocked list doesn't match anything-- customer_id = 3: No match found → row included-- Result: Normal filtering behavior -- The heuristic transformation changes semantics!-- Correct conversion must handle NULLs explicitly:SELECT * FROM orders oWHERE NOT EXISTS ( SELECT 1 FROM blocked_customers b WHERE b.id = o.customer_id)AND o.customer_id IS NOT NULL; -- Explicit NULL handlingPushing aggregates below outer joins requires careful semantic analysis:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- Potentially unsafe aggregate pushdown-- Original: Aggregate after left joinSELECT c.id, SUM(o.total)FROM customers cLEFT JOIN orders o ON c.id = o.customer_idGROUP BY c.id; -- Customers with no orders: SUM(o.total) = NULL -- Heuristic: Push aggregate into subquerySELECT c.id, order_totals.total_sumFROM customers cLEFT JOIN ( SELECT customer_id, SUM(total) as total_sum FROM orders GROUP BY customer_id) order_totals ON c.id = order_totals.customer_id; -- Customers with no orders: total_sum = NULL (from LEFT JOIN)-- Same semantics! Transformation is safe here. -- BUT consider COUNT:SELECT c.id, COUNT(o.id)FROM customers cLEFT JOIN orders o ON c.id = o.customer_idGROUP BY c.id; -- Customers with no orders: COUNT(o.id) = 0 (counts NULLs as 0) -- Naive pushdown:SELECT c.id, order_counts.cntFROM customers cLEFT JOIN ( SELECT customer_id, COUNT(id) as cnt FROM orders GROUP BY customer_id) order_counts ON c.id = order_counts.customer_id; -- Customers with no orders: cnt = NULL (no row in subquery!)-- WRONG semantics! Should be 0, not NULL! -- Correct transformation requires COALESCE:SELECT c.id, COALESCE(order_counts.cnt, 0)FROM customers cLEFT JOIN (...) order_counts ON ...The order of projection and DISTINCT can affect results:
12345678910111213141516171819202122232425262728293031
-- Table: events (user_id, event_type, event_data)-- Multiple events per user, event_data may vary -- Query: Distinct users with their (any) event dataSELECT DISTINCT user_id, event_dataFROM eventsWHERE event_type = 'login'; -- Returns: Multiple rows per user (different event_data values) -- If optimizer did late projection:-- 1. Filter to 'login' events-- 2. Apply DISTINCT on (user_id, event_data) — all columns-- 3. Project to (user_id, event_data)-- Result: Multiple rows per user -- If optimizer incorrectly applied early projection + DISTINCT:-- 1. Filter to 'login' events -- 2. Project to (user_id, event_data)... then somehow DISTINCT?-- Actually, this particular reordering isn't done,-- but the principle matters for GROUP BY -- Related issue with GROUP BY:SELECT user_id, event_data -- Which event_data for each user?FROM eventsGROUP BY user_id; -- This query is actually invalid in standard SQL!-- event_data is non-deterministic for each group-- Some databases allow it (returns arbitrary row)-- Optimizer can't safely optimize non-deterministic selectionsProduction optimizers undergo extensive testing with edge cases: empty tables, NULL-heavy data, single-row tables, all-identical values, extreme selectivities, and pathological distributions. Despite this, bugs in obscure edge cases are discovered regularly, especially after new heuristics are added.
When heuristics fail, practitioners have several strategies to guide the optimizer toward better plans.
The first defense against poor plans is accurate statistics:
123456789101112131415161718192021222324
-- PostgreSQL: Update statisticsANALYZE table_name; -- MySQL: Update statisticsANALYZE TABLE table_name; -- SQL Server: Update statisticsUPDATE STATISTICS table_name; -- Oracle: Gather statisticsEXEC DBMS_STATS.GATHER_TABLE_STATS('schema', 'table_name'); -- Extended statistics for correlated columns (PostgreSQL):CREATE STATISTICS stat_name (dependencies) ON department, job_title FROM employees;ANALYZE employees; -- Histogram creation for skewed columns:-- Most databases create automatically during ANALYZE-- Some allow explicit histogram specification -- Force fresh statistics on critical tablesALTER TABLE orders ALTER COLUMN product_id SET STATISTICS 1000;ANALYZE orders;When statistics aren't enough, hints can override optimizer decisions:
1234567891011121314151617181920212223242526272829
-- PostgreSQL: Disable specific operationsSET enable_seqscan = off; -- Force index usageSET enable_hashjoin = off; -- Force merge or nested loop -- MySQL: Optimizer hintsSELECT /*+ NO_HASH_JOIN(o) */ *FROM orders oJOIN customers c ON o.customer_id = c.id; SELECT /*+ INDEX(orders idx_orders_date) */ *FROM orders WHERE order_date = '2024-01-15'; -- SQL Server: Query hintsSELECT * FROM orders oJOIN customers c ON o.customer_id = c.idOPTION (MERGE JOIN); -- Force merge join SELECT * FROM ordersWITH (INDEX(idx_orders_date)) -- Force indexWHERE order_date = '2024-01-15'; -- Oracle: HintsSELECT /*+ LEADING(c o) USE_HASH(o) */ *FROM customers cJOIN orders o ON c.id = o.customer_id; -- PostgreSQL: pg_hint_plan extension/*+ SeqScan(o) HashJoin(o c) Leading(c o) */SELECT * FROM orders o JOIN customers c ON ...Sometimes the best solution is rewriting the query to be more optimizer-friendly:
12345678910111213141516171819202122232425262728293031323334353637
-- Problem: Optimizer doesn't push predicate into view-- Original:SELECT * FROM complex_view WHERE category = 'Electronics'; -- Fix: Inline view and add predicate directlySELECT ... FROM base_table1 t1JOIN base_table2 t2 ON ...WHERE t2.category = 'Electronics'; -- Optimizer can now push -- Problem: Subquery correlation prevents optimization-- Original:SELECT *, (SELECT SUM(amount) FROM payments p WHERE p.order_id = o.id)FROM orders o; -- Fix: Use LEFT JOIN for better optimizationSELECT o.*, p.total_paymentsFROM orders oLEFT JOIN ( SELECT order_id, SUM(amount) as total_payments FROM payments GROUP BY order_id) p ON p.order_id = o.id; -- Problem: OR prevents predicate pushdown-- Original:SELECT * FROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.region = 'West' OR o.priority = 'high'; -- Fix: Use UNION to enable pushdownSELECT * FROM orders oJOIN customers c ON o.customer_id = c.idWHERE c.region = 'West'UNIONSELECT * FROM orders oJOIN customers c ON o.customer_id = c.id WHERE o.priority = 'high';Query hints lock in a plan that may become suboptimal as data grows or distribution changes. Use hints sparingly, document why each hint exists, and periodically review whether hints are still necessary. Prefer fixing underlying issues (statistics, schema design) over permanent hints.
Heuristic optimization provides powerful, reliable improvements for most queries. But understanding its limitations is essential for diagnosing performance problems and knowing when to intervene.
Module Complete:
You've now completed the Heuristic Optimization module. You understand the rule-based optimization paradigm, common heuristic transformations, the deep mechanics of selection and projection pushdown, and the limitations that necessitate cost-based approaches. This foundation prepares you for studying cost-based optimization in the next module.
You now have comprehensive knowledge of heuristic query optimization—its power, its techniques, and its limitations. This understanding is crucial for both working with database systems effectively and appreciating the design of modern hybrid optimizers.