Loading learning content...
Correlated subqueries carry a reputation for poor performance—sometimes deserved, often not. The conceptual model of "execute subquery once per outer row" conjures images of exponential slowdowns. But reality is more nuanced.
Modern database optimizers are remarkably sophisticated at transforming, decorrelating, and optimizing subqueries. A correlated subquery that looks expensive may execute as efficiently as a carefully crafted join. Conversely, a seemingly innocent correlated query may indeed cause performance disasters when:
This page equips you to understand, analyze, and optimize correlated subquery performance—whether writing new queries or diagnosing production issues.
By the end of this page, you will understand when correlated subqueries cause performance problems, how to analyze execution plans, indexing strategies for optimization, techniques for query rewriting, and when to accept the performance characteristics of correlated approaches.
To reason about correlated subquery performance, you need to understand both the theoretical model and how optimizers modify it.
Theoretical Complexity:
For a correlated subquery without optimization:
If N = 100,000 and M = 100,000 (no filtering), you're looking at 10 billion operations—clearly problematic.
What Reduces Actual Cost:
123456789101112131415161718192021222324252627282930313233343536
-- SCENARIO 1: Small, well-indexed-- Outer: 1,000 customers-- Inner: Orders indexed on customer_id-- Each subquery: O(log n) index lookup-- Total: ~1,000 × log(orders) = very fast SELECT c.name FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id -- Indexed!); -- SCENARIO 2: Large, no index (DANGER)-- Outer: 100,000 products-- Inner: 1,000,000 reviews, no index on product_id-- Each subquery: Full scan of 1M rows-- Total: 100,000 × 1,000,000 = 100 billion operations SELECT p.name FROM products pWHERE p.price > ( SELECT AVG(r.rating) FROM reviews r -- No index! WHERE r.product_id = p.product_id); -- SCENARIO 3: Decorrelated by optimizer-- Optimizer recognizes pattern and transforms to:SELECT p.name FROM products pJOIN ( SELECT product_id, AVG(rating) as avg_rating FROM reviews GROUP BY product_id) r ON r.product_id = p.product_idWHERE p.price > r.avg_rating;-- Now: Single scan of reviews + hash join = O(n + m)The execution plan reveals how the database actually runs your query. Learning to read plans for correlated subqueries is essential for performance work.
Key Things to Look For:
| Plan Element | Indicates | Performance Implication |
|---|---|---|
| SubPlan / Subquery Scan | Subquery executing as-written | Potential N×M if not optimized |
| Hash Semi Join / Anti Join | EXISTS/NOT EXISTS optimized | Efficient—executed as join |
| Nested Loop + Index Lookup | Per-row with index | Acceptable for small outer sets |
| Hash Join / Merge Join | Decorrelated to join | Efficient—computed once |
| Seq Scan in subquery | No index being used | Red flag for large tables |
| loops=N (high number) | Subquery ran N times | Problem if N is large |
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- PostgreSQL: Use EXPLAIN ANALYZE for actual execution statsEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT e.name, e.salaryFROM employees eWHERE e.salary > ( SELECT AVG(e2.salary) FROM employees e2 WHERE e2.dept_id = e.dept_id); -- Sample output analysis:/*Seq Scan on employees e (cost=0.00..2501.00 rows=333 loops=1) Filter: (salary > (SubPlan 1)) Rows Removed by Filter: 667 SubPlan 1 -> Aggregate (cost=24.50..24.51 rows=1 loops=1000) ^^^^^^^^^ RED FLAG: 1000 loops! -> Seq Scan on employees e2 Filter: (dept_id = e.dept_id) ⚠️ "loops=1000" means subquery executed 1000 times⚠️ "Seq Scan on employees e2" means full table scan each time*/ -- Compare with decorrelated version:EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT e.name, e.salaryFROM employees eJOIN ( SELECT dept_id, AVG(salary) as avg_sal FROM employees GROUP BY dept_id) d ON d.dept_id = e.dept_idWHERE e.salary > d.avg_sal; /*Hash Join (cost=...) (loops=1) ^^^^^^^ Single execution! -> Seq Scan on employees -> Hash (Subquery Scan on employees) -> HashAggregate (GROUP BY dept_id)*/In PostgreSQL EXPLAIN output, watch the 'loops' count. loops=1 for operations inside a subquery means decorrelation succeeded. loops=10000 means I trouble—the subquery ran 10,000 times. MySQL and SQL Server have equivalent indicators in their plan formats.
Proper indexing can transform a catastrophically slow correlated subquery into an efficient query. The key is indexing the correlation columns—the columns that link inner to outer query.
1234567891011121314151617181920212223242526272829303132333435363738
-- THE QUERY:SELECT c.customer_id, c.nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id -- Correlation column AND o.order_date >= '2024-01-01' -- Additional filter); -- REQUIRED INDEX:-- Index on the correlation column(s) in the INNER tableCREATE INDEX idx_orders_customer_id ON orders(customer_id); -- BETTER INDEX:-- Covering index includes the filter columnCREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date); -- With this index:-- For each customer, the EXISTS check uses index seek → O(log n)-- Without index: Full table scan → O(n) -- AGGREGATE CORRELATION:SELECT e.nameFROM employees eWHERE e.salary > ( SELECT AVG(e2.salary) FROM employees e2 WHERE e2.dept_id = e.dept_id -- Correlation column); -- INDEX FOR AGGREGATE:CREATE INDEX idx_employees_dept_id ON employees(dept_id); -- Even better (covering index with salary for AVG):CREATE INDEX idx_employees_dept_salary ON employees(dept_id, salary); -- This lets the subquery compute AVG from index alone (index-only scan)A common mistake is indexing the correlation column on the outer table instead of the inner table. For 'WHERE inner.fk = outer.pk', you need an index on inner.fk, not outer.pk. The inner table is what gets searched repeatedly.
When the optimizer doesn't decorrelate automatically, or when you need explicit control, manual query rewriting provides alternatives. These transformations preserve semantics while changing execution characteristics.
Manual Decorrelation to JOIN:
Compute the correlated values once, then join.
12345678910111213141516171819202122232425
-- ORIGINAL (correlated):SELECT p.product_id, p.name, p.priceFROM products pWHERE p.price > ( SELECT AVG(p2.price) FROM products p2 WHERE p2.category_id = p.category_id); -- REWRITTEN (decorrelated join):WITH category_avgs AS ( SELECT category_id, AVG(price) as avg_price FROM products GROUP BY category_id)SELECT p.product_id, p.name, p.priceFROM products pJOIN category_avgs ca ON ca.category_id = p.category_idWHERE p.price > ca.avg_price; -- Benefits:-- • AVG computed once per category-- • Single pass through products-- • Clear, readable structure-- • Often faster execution planCertain patterns consistently cause performance problems with correlated subqueries. Recognizing and avoiding these anti-patterns prevents common performance disasters.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
-- ANTI-PATTERN 1: Multiple correlated subqueries computing similar things-- ❌ BAD: Three separate subqueries, three passesSELECT c.name, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id), (SELECT SUM(amount) FROM orders o WHERE o.customer_id = c.customer_id), (SELECT AVG(amount) FROM orders o WHERE o.customer_id = c.customer_id)FROM customers c; -- ✓ GOOD: Single aggregationSELECT c.name, stats.cnt, stats.total, stats.avg_amountFROM customers cLEFT JOIN LATERAL ( SELECT COUNT(*) cnt, SUM(amount) total, AVG(amount) avg_amount FROM orders o WHERE o.customer_id = c.customer_id) stats ON true; -- ANTI-PATTERN 2: Correlated subquery on unindexed column-- ❌ BAD: Full scan for every outer rowSELECT p.product_nameFROM products pWHERE p.price > ( SELECT AVG(r.sentiment_score) -- No index on product_id! FROM reviews r WHERE r.product_id = p.product_id); -- ✓ FIX: Add indexCREATE INDEX idx_reviews_product ON reviews(product_id);-- Or rewrite to JOIN with pre-computed aggregates -- ANTI-PATTERN 3: Correlated subquery with JOIN inside-- ❌ BAD: Complex join executed per-rowSELECT c.nameFROM customers cWHERE ( SELECT SUM(oi.quantity * oi.price) FROM orders o JOIN order_items oi ON oi.order_id = o.order_id JOIN products p ON p.product_id = oi.product_id WHERE o.customer_id = c.customer_id AND p.category = 'Electronics') > 1000; -- ✓ GOOD: Pre-compute the aggregationWITH customer_electronics_spend AS ( SELECT o.customer_id, SUM(oi.quantity * oi.price) as total FROM orders o JOIN order_items oi ON oi.order_id = o.order_id JOIN products p ON p.product_id = oi.product_id WHERE p.category = 'Electronics' GROUP BY o.customer_id)SELECT c.nameFROM customers cJOIN customer_electronics_spend ces ON ces.customer_id = c.customer_idWHERE ces.total > 1000; -- ANTI-PATTERN 4: UsingFunction in correlation predicate-- ❌ BAD: Function prevents index usageSELECT c.name FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE UPPER(o.customer_code) = UPPER(c.code) -- No index use!); -- ✓ GOOD: Use consistent case in data, or functional indexCREATE INDEX idx_orders_customer_code_upper ON orders(UPPER(customer_code));The worst anti-pattern: writing correlated subqueries with complex JOINs on large, unindexed tables without checking execution plans. Always EXPLAIN ANALYZE before deploying correlated subqueries on production-scale data.
Not all correlated subqueries need optimization. In many cases, they perform excellently without intervention. Understanding when to accept correlated performance prevents premature optimization.
123456789101112131415161718192021222324252627282930313233
-- ACCEPTABLE: EXISTS with good index, common matchesSELECT c.customer_id, c.nameFROM customers cWHERE c.signup_date >= '2024-01-01' -- Selective outer filterAND EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id -- Indexed);-- New customers (small set) + indexed EXISTS = fast -- ACCEPTABLE: Scalar subquery with cached resultsSELECT e.name, e.salary, (SELECT d.dept_name FROM departments d WHERE d.dept_id = e.dept_id) as departmentFROM employees e;-- 1000 employees, 10 departments = 10 unique lookups (cached) -- ACCEPTABLE: Small outer set, any subquery complexitySELECT c.name, (SELECT SUM(total) FROM orders WHERE customer_id = c.customer_id)FROM customers cWHERE c.customer_id IN (101, 102, 103); -- Only 3 customers!-- 3 subquery executions, complexity doesn't matter -- CHECK BEFORE OPTIMIZING:EXPLAIN ANALYZESELECT ... -- your correlated query -- If total execution time is acceptable (e.g., < 100ms for online,-- < 1 minute for batch), optimization effort may not be worthwhilePerformance of correlated subqueries can change over time as data grows or distributions shift. Continuous monitoring ensures queries remain performant.
1234567891011121314151617181920212223242526272829303132
-- PostgreSQL: Find slow queries with subplansSELECT query, calls, mean_time, total_timeFROM pg_stat_statementsWHERE query ILIKE '%EXISTS%' OR query ILIKE '%SELECT%SELECT%'ORDER BY mean_time DESCLIMIT 20; -- MySQL: Performance Schema for subquery analysisSELECT DIGEST_TEXT, COUNT_STAR, AVG_TIMER_WAIT/1000000000 as avg_msFROM performance_schema.events_statements_summary_by_digestWHERE DIGEST_TEXT LIKE '%EXISTS%'ORDER BY AVG_TIMER_WAIT DESCLIMIT 20; -- Create a baseline for critical queries-- Store execution plans and times, compare periodicallyCREATE TABLE query_performance_log ( query_id VARCHAR(100), captured_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, execution_time_ms NUMERIC, plan_hash VARCHAR(64), row_estimate INTEGER, actual_rows INTEGER); -- Log performance periodicallyEXPLAIN (ANALYZE, FORMAT JSON)SELECT ...; -- Critical correlated query -- Compare plan_hash over time to detect plan regressionsCorrelated subquery performance often degrades nonlinearly with data growth. A query that takes 100ms at 100K rows might take 10 seconds at 1M rows and 10 minutes at 10M rows. Plan for growth.
We've comprehensively covered performance aspects of correlated subqueries. Here are the essential takeaways:
Module Complete! You've now mastered correlated subqueries—from the fundamental concept of correlation, through EXISTS and NOT EXISTS operators, the comparison with non-correlated alternatives, and finally performance optimization. These skills enable you to write sophisticated, efficient SQL queries that leverage one of the most powerful features of the SQL language.
You now have comprehensive knowledge of correlated subquery performance—how to analyze it, optimize it, and know when optimization is unnecessary. Combined with the previous pages, you have complete command over correlated subqueries in SQL.