Loading learning content...
Subqueries are one of SQL's most powerful features—they let you nest one query inside another, composing complex logic from simple building blocks. But this composability comes with a hidden danger: subqueries can be the source of some of the most severe performance problems in database systems.
A subquery that works instantly for 100 rows might grind to a halt at 100,000 rows. The difference is rarely about the subquery itself—it's about how the database executes it, and whether that execution strategy scales with your data. Understanding subquery optimization transforms mysterious slow queries into predictable, efficient operations.
By the end of this page, you'll understand the critical difference between correlated and non-correlated subqueries, master strategies for transforming expensive subqueries, learn when subqueries outperform joins, and develop intuition for recognizing subquery performance traps.
The way a subquery is executed depends fundamentally on whether it references values from the outer query. This distinction—correlated vs. non-correlated—determines whether the subquery runs once or potentially millions of times.
Non-Correlated Subqueries (Run Once):
A non-correlated subquery is independent of the outer query. It produces a fixed result that the outer query uses like a constant or a lookup table.
1234567891011121314151617181920212223
-- Non-correlated scalar subquerySELECT employee_name, salaryFROM employeesWHERE salary > (SELECT AVG(salary) FROM employees); -- Execution flow:-- 1. Execute subquery ONCE: AVG(salary) = $75,000-- 2. Substitute result: WHERE salary > 75000-- 3. Execute outer query with constant filter-- Total subquery executions: 1 -- Non-correlated IN subquerySELECT product_nameFROM productsWHERE category_id IN ( SELECT id FROM categories WHERE name LIKE 'Electronics%'); -- Execution flow:-- 1. Execute subquery ONCE: Returns [5, 12, 23]-- 2. Substitute: WHERE category_id IN (5, 12, 23)-- 3. Execute as index lookup or hash semi-join-- Total subquery executions: 1Correlated Subqueries (Run Per Row):
A correlated subquery references columns from the outer query. It must be re-evaluated for each row the outer query considers.
12345678910111213141516171819
-- Correlated scalar subquerySELECT e1.employee_name, e1.salary, e1.department_idFROM employees e1WHERE e1.salary > ( SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department_id = e1.department_id -- Correlation!); -- Execution flow (naive):-- For EACH employee in e1:-- 1. Execute subquery with e1.department_id = 'Sales' → $80K-- 2. Compare e1.salary > $80K-- 3. Move to next employee-- 4. Execute subquery with e1.department_id = 'Engineering' → $95K-- 5. Compare...-- Total subquery executions: Same as number of employees! -- With 100,000 employees: 100,000 subquery executions| Aspect | Non-Correlated | Correlated (Naive) |
|---|---|---|
| Subquery executions | 1 | N (rows in outer query) |
| With 100K outer rows | 1 execution | 100,000 executions |
| With 1M outer rows | 1 execution | 1,000,000 executions |
| Time complexity | O(subquery) + O(outer) | O(outer × subquery) |
| Impact of slow subquery | Constant overhead | Multiplied by outer rows |
Correlated subqueries can be exponentially slower than they appear. A subquery that executes in 10ms becomes a 28-hour query when correlated across 10 million rows. Always examine whether your subquery correlates to the outer query.
Modern query optimizers apply sophisticated transformations to improve subquery execution. Understanding these transformations helps you write subqueries the optimizer can improve—or rewrite them yourself when it can't.
Subquery Flattening (Unnesting):
The optimizer converts subqueries into joins when possible, eliminating repeated execution:
123456789101112131415161718192021
-- Original: Correlated subquerySELECT e.nameFROM employees eWHERE e.salary > ( SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department_id = e.department_id); -- Optimizer transforms to (conceptually):SELECT e.nameFROM employees eJOIN ( SELECT department_id, AVG(salary) as avg_salary FROM employees GROUP BY department_id) dept_avgs ON dept_avgs.department_id = e.department_idWHERE e.salary > dept_avgs.avg_salary; -- The subquery now executes ONCE (grouped), then joins-- Instead of N subqueries, now: 1 aggregate + 1 joinSemi-Join Conversion:
IN and EXISTS subqueries are often converted to semi-joins:
123456789101112131415
-- Original: IN subquerySELECT c.nameFROM customers cWHERE c.id IN (SELECT customer_id FROM orders WHERE total > 1000); -- Optimizer may convert to semi-join:SELECT c.nameFROM customers cSEMI JOIN orders o ON o.customer_id = c.id AND o.total > 1000;-- (SEMI JOIN is internal - returns c rows for which match exists) -- Semi-join advantage:-- - Stops at first match (doesn't return duplicates)-- - Can use hash or nested loop strategies-- - More optimization options than correlated subqueryAnti-Join Conversion:
NOT IN and NOT EXISTS convert to anti-joins:
1234567891011121314151617
-- Original: NOT EXISTS subquerySELECT c.nameFROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id); -- Optimizer converts to anti-join:SELECT c.nameFROM customers cANTI JOIN orders o ON o.customer_id = c.id;-- (ANTI JOIN returns c rows with NO match in orders) -- Anti-join strategies:-- - Hash anti-join: Build hash of orders, probe with customers-- - Merge anti-join: Sorted merge, return non-matches-- - Nested loop anti-join: For each customer, seek in orders indexUse EXPLAIN to see if the optimizer transformed your subquery. Look for 'Semi Join', 'Anti Join', or subquery appearing as a derived table join. If you see 'SubPlan' or 'InitPlan' with loops = N, the subquery is executing repeatedly.
Two common patterns for testing set membership—EXISTS and IN—have nuanced performance characteristics. Understanding when each excels helps you choose the optimal pattern.
EXISTS Behavior:
EXISTS returns true/false based on whether the subquery returns any rows. It's inherently efficient for existence checks:
123456789101112131415
-- EXISTS: Stops at first matchSELECT c.nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.status = 'delivered'); -- For a customer with 1000 orders:-- EXISTS: Checks orders until first 'delivered' found, stops-- If first order matches, checks only 1 order -- Index optimization:-- With index on orders(customer_id, status):-- EXISTS performs a single index seek per customerIN Behavior:
IN compares against a set of values. Performance depends on subquery result size and optimization strategy:
1234567891011121314151617181920
-- IN: Works with the result setSELECT c.nameFROM customers cWHERE c.id IN ( SELECT customer_id FROM orders WHERE status = 'delivered'); -- Execution strategies:-- 1. Materialize subquery, then hash semi-join-- - Subquery runs once, builds hash table-- - Probe with each customer-- - Good for large subquery results -- 2. Convert to correlated EXISTS (some optimizers)-- - For each customer, check if in orders-- - Good for small outer table -- 3. Nested loop semi-join with index-- - For each customer, index seek in subquery result-- - Good when outer table is smallWhen to Use Each:
Critical: IN with NULLs:
12345678910111213141516171819202122
-- The NULL trap with NOT INSELECT c.name FROM customers cWHERE c.id NOT IN (SELECT customer_id FROM orders); -- If ANY customer_id in orders is NULL:-- NOT IN evaluates as: id <> 1 AND id <> 2 AND id <> NULL ...-- Anything compared to NULL is UNKNOWN-- UNKNOWN AND TRUE = UNKNOWN-- WHERE UNKNOWN = No rows returned! -- Solution 1: Filter NULLsSELECT c.name FROM customers cWHERE c.id NOT IN ( SELECT customer_id FROM orders WHERE customer_id IS NOT NULL); -- Solution 2: Use NOT EXISTS (NULL-safe)SELECT c.name FROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id);-- NULL in orders.customer_id never equals c.id, so it works correctlyNOT IN (subquery) returns NO ROWS if the subquery contains ANY NULL value. This is a frequent source of bugs. Use NOT EXISTS for NULL-safe anti-join semantics, or explicitly exclude NULLs from the subquery.
Scalar subqueries return a single value and appear in SELECT lists, WHERE clauses, or expressions. Their optimization is critical because correlated scalar subqueries can execute once per output row.
Scalar Subquery in SELECT:
1234567891011121314151617181920212223242526272829303132
-- Problem: Correlated scalar subquery in SELECTSELECT c.name, c.id, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) as order_count, (SELECT MAX(o.total) FROM orders o WHERE o.customer_id = c.id) as max_order, (SELECT SUM(o.total) FROM orders o WHERE o.customer_id = c.id) as total_spentFROM customers c; -- For 100,000 customers: 300,000 subquery executions!-- Each subquery scans orders table (or uses index 100K times) -- Solution: Aggregate once, joinSELECT c.name, c.id, COALESCE(stats.order_count, 0) as order_count, stats.max_order, COALESCE(stats.total_spent, 0) as total_spentFROM customers cLEFT JOIN ( SELECT customer_id, COUNT(*) as order_count, MAX(total) as max_order, SUM(total) as total_spent FROM orders GROUP BY customer_id) stats ON stats.customer_id = c.id; -- One aggregation pass over orders, then one hash join-- Complexity: O(orders) + O(customers) instead of O(customers × orders)Scalar Subquery in WHERE:
123456789101112131415161718192021
-- Non-correlated: Runs once (efficient)SELECT name FROM productsWHERE price > (SELECT AVG(price) FROM products);-- Subquery result is constant, optimizer executes once -- Correlated: Runs per row (expensive)SELECT e.name FROM employees eWHERE e.salary > ( SELECT AVG(salary) FROM employees e2 WHERE e2.department_id = e.department_id); -- Transformation to join:SELECT e.nameFROM employees eJOIN ( SELECT department_id, AVG(salary) as avg_sal FROM employees GROUP BY department_id) dept_avg ON dept_avg.department_id = e.department_idWHERE e.salary > dept_avg.avg_sal;Multiple Scalar Subqueries Consolidation:
When multiple scalar subqueries reference the same table, consolidate them:
1234567891011121314151617181920212223242526272829303132
-- Before: 4 separate subqueriesSELECT product_id, (SELECT name FROM products p WHERE p.id = oi.product_id) as name, (SELECT price FROM products p WHERE p.id = oi.product_id) as price, (SELECT category FROM products p WHERE p.id = oi.product_id) as category, (SELECT manufacturer FROM products p WHERE p.id = oi.product_id) as mfrFROM order_items oi; -- After: Single joinSELECT oi.product_id, p.name, p.price, p.category, p.manufacturerFROM order_items oiJOIN products p ON p.id = oi.product_id; -- Even better with CTE for clarity:WITH product_details AS ( SELECT id, name, price, category, manufacturer FROM products)SELECT oi.product_id, pd.name, pd.price, pd.category, pd.manufacturerFROM order_items oiJOIN product_details pd ON pd.id = oi.product_id;Some optimizers cache scalar subquery results for repeated correlation values. PostgreSQL's 'SubPlan' shows 'Loops: N' but may cache results for repeated department_id values. However, don't rely on this—explicit transformation is more reliable.
Derived tables (subqueries in FROM clause) create temporary result sets. Their optimization depends on when they're materialized versus inlined.
Materialization vs. Merge:
123456789101112131415161718192021222324
-- Simple derived table (often merged/inlined)SELECT * FROM (SELECT id, name FROM customers WHERE status = 'active') cWHERE c.name LIKE 'A%'; -- Optimizer typically merges into:SELECT id, name FROM customersWHERE status = 'active' AND name LIKE 'A%';-- Both filters applied together, single index scan if available -- Complex derived table (must materialize)SELECT dt.category, dt.avg_priceFROM ( SELECT category, AVG(price) as avg_price FROM products GROUP BY category HAVING AVG(price) > 100) dtWHERE dt.category IN ('Electronics', 'Books'); -- Structure forces materialization:-- 1. Execute full aggregation (can't push category filter into GROUP BY)-- 2. Materialize result with categories and averages-- 3. Then filter for specific categoriesOptimization Blockers:
Certain patterns prevent derived table optimization:
| Pattern | Why It Blocks Optimization | Alternative |
|---|---|---|
| DISTINCT in derived table | Must deduplicate before join | Move DISTINCT to final query if possible |
| GROUP BY in derived table | Must aggregate before join | Consider window functions or join first |
| TOP/LIMIT in derived table | Must materialize partial result | Apply LIMIT to final query when possible |
| UNION in derived table | Must combine sets before join | Restructure as separate queries with UNION ALL |
| Aggregate without correlation | Compute whole aggregate first | Use correlated aggregate or window function |
Strategic Derived Table Use:
Derived tables can help optimization when used strategically:
123456789101112131415161718192021222324252627282930
-- Good use: Pre-filter before expensive joinSELECT c.name, recent_orders.*FROM customers cJOIN ( SELECT * FROM orders WHERE order_date > CURRENT_DATE - 30) recent_orders ON recent_orders.customer_id = c.id;-- Forces optimizer to filter orders first, then join -- Good use: Reduce join cardinalitySELECT c.name, top_order.totalFROM customers cJOIN ( SELECT DISTINCT ON (customer_id) customer_id, total FROM orders ORDER BY customer_id, total DESC) top_order ON top_order.customer_id = c.id;-- Each customer joins to exactly one row -- Good use: Encapsulate complex logicSELECT *FROM ( SELECT customer_id, SUM(total) as lifetime_value, NTILE(4) OVER (ORDER BY SUM(total) DESC) as quartile FROM orders GROUP BY customer_id) customer_tiersWHERE quartile = 1; -- Top 25% of customersCTEs (WITH clause) and derived tables often produce identical plans. However, some databases (older PostgreSQL, SQLite) materialize CTEs by default, while derived tables are more likely to be inlined. Check EXPLAIN for your specific database version.
When the optimizer can't transform a subquery efficiently, manual transformation to joins often provides dramatic speedups.
Pattern 1: Correlated Aggregate to JOIN:
12345678910111213141516171819
-- Before: Correlated subquery (per-row execution)SELECT p.name, p.price, p.price - (SELECT AVG(p2.price) FROM products p2 WHERE p2.category_id = p.category_id) as diff_from_avgFROM products p; -- After: Pre-aggregate and joinSELECT p.name, p.price, p.price - cat_avg.avg_price as diff_from_avgFROM products pJOIN ( SELECT category_id, AVG(price) as avg_price FROM products GROUP BY category_id) cat_avg ON cat_avg.category_id = p.category_id; -- Or using window function (often most elegant):SELECT name, price, price - AVG(price) OVER (PARTITION BY category_id) as diff_from_avgFROM products;Pattern 2: EXISTS to JOIN:
12345678910111213141516171819
-- Before: EXISTS subquerySELECT DISTINCT c.id, c.nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.total > 1000); -- After: JOIN with DISTINCTSELECT DISTINCT c.id, c.nameFROM customers cJOIN orders o ON o.customer_id = c.id WHERE o.total > 1000; -- Or semi-join syntax where supported (cleaner semantics):SELECT c.id, c.nameFROM customers cWHERE c.id IN (SELECT customer_id FROM orders WHERE total > 1000);-- Modern optimizers convert this to semi-join automaticallyPattern 3: NOT EXISTS to LEFT JOIN + IS NULL:
12345678910111213141516171819
-- Before: NOT EXISTS subquerySELECT c.id, c.nameFROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id); -- After: LEFT JOIN + IS NULL checkSELECT c.id, c.nameFROM customers cLEFT JOIN orders o ON o.customer_id = c.idWHERE o.customer_id IS NULL; -- Which is faster depends on:-- - Optimizer capabilities (modern ones: similar performance)-- - Available indexes-- - Table sizes and selectivity -- Test both with EXPLAIN ANALYZE in your environmentPattern 4: Aggregating TOP-N to Window Function:
1234567891011121314151617181920212223242526272829303132
-- Before: Correlated subquery for "latest order per customer"SELECT c.name, o.order_id, o.totalFROM customers cJOIN orders o ON o.customer_id = c.idWHERE o.order_date = ( SELECT MAX(order_date) FROM orders o2 WHERE o2.customer_id = c.id);-- Subquery runs for each order row! -- After: Window function (single pass)SELECT c.name, o.order_id, o.totalFROM customers cJOIN ( SELECT order_id, customer_id, total, order_date, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) as rn FROM orders) o ON o.customer_id = c.id AND o.rn = 1; -- Even cleaner with CTE:WITH latest_orders AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn FROM orders)SELECT c.name, lo.order_id, lo.totalFROM customers cJOIN latest_orders lo ON lo.customer_id = c.id AND lo.rn = 1;Always verify transformations produce identical results with test data. Subtle semantic differences (NULLs, duplicates, edge cases) can cause transformations to produce different results. Compare COUNT(*) and spot-check sample rows.
Despite general guidance to prefer joins, subqueries sometimes outperform their join equivalents. Understanding these cases prevents unnecessary refactoring.
Case 1: Early Filtering with EXISTS:
123456789101112131415161718192021
-- Scenario: Find customers who ordered product X-- Customer 12345 has 50,000 orders, product X is in order #5 -- JOIN approach: Scans all matching ordersSELECT DISTINCT c.nameFROM customers cJOIN orders o ON o.customer_id = c.idJOIN order_items oi ON oi.order_id = o.order_idWHERE oi.product_id = 'X' AND c.id = 12345;-- Joins all 50,000 orders, then filters to product X -- EXISTS approach: Stops at first matchSELECT c.nameFROM customers cWHERE c.id = 12345 AND EXISTS ( SELECT 1 FROM orders o JOIN order_items oi ON oi.order_id = o.order_id WHERE o.customer_id = c.id AND oi.product_id = 'X');-- Finds product X in order #5, immediately returnsCase 2: Avoiding Cardinality Explosion:
123456789101112131415161718192021
-- Scenario: Count customers who have any order with items -- JOIN approach: Cardinality explosionSELECT COUNT(DISTINCT c.id)FROM customers cJOIN orders o ON o.customer_id = c.idJOIN order_items oi ON oi.order_id = o.order_id;-- If average customer has 100 orders with 10 items each:-- 10,000 customers × 100 orders × 10 items = 10 million intermediate rows-- Then COUNT DISTINCT de-duplicates expensively -- EXISTS approach: No explosionSELECT COUNT(*)FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o JOIN order_items oi ON oi.order_id = o.order_id WHERE o.customer_id = c.id);-- 10,000 customers × 1 existence check each = 10,000 checks-- Each check stops at first matchCase 3: Scalar Subquery for Single Value:
123456789101112131415
-- Scenario: Show each product with the global average price -- JOIN approach: Compute Cartesian productSELECT p.name, p.price, avg_data.global_avgFROM products pCROSS JOIN (SELECT AVG(price) as global_avg FROM products) avg_data;-- Semantic: Cross join suggests Cartesian, even though result is 1 row -- Scalar subquery: Cleaner, equally efficientSELECT p.name, p.price, (SELECT AVG(price) FROM products) as global_avgFROM products p;-- The subquery executes once, result used as constant-- Cleaner intent: "compute this one value" -- Both produce identical plans in most optimizersModern optimizers often transform subqueries and joins into equivalent plans. The 'subquery is slow' advice originated when optimizers were less sophisticated. Always measure with EXPLAIN ANALYZE rather than assuming one form is faster.
When subqueries cause performance problems, systematic diagnosis reveals the root cause.
Step 1: Check Execution Count
12345678910111213141516171819
-- PostgreSQL EXPLAIN ANALYZE reveals loop countEXPLAIN ANALYZESELECT e.nameFROM employees eWHERE e.salary > ( SELECT AVG(salary) FROM employees e2 WHERE e2.department_id = e.department_id); -- Look for "SubPlan" or "InitPlan" sections:-- SubPlan 1 (returns $2)-- Aggregate (actual time=0.1..0.1 rows=1 loops=10000) <-- 10000 executions!-- Seq Scan on employees e2 -- For non-correlated subquery, you'd see loops=1 -- MySQL EXPLAIN shows SELECT type:-- DEPENDENT SUBQUERY = correlated (bad if outer table large)-- SUBQUERY = non-correlated (good, runs once)Step 2: Identify the Correlation:
1234567891011121314151617
-- The correlation is the outer reference in the subquerySELECT o.order_idFROM orders oWHERE o.total > ( SELECT AVG(o2.total) FROM orders o2 WHERE o2.customer_id = o.customer_id -- This is the correlation AND o2.order_date > o.order_date -- Another correlation); -- Each outer reference creates a dependency:-- The subquery depends on o.customer_id and o.order_date from outer query-- These values change for each row, forcing re-execution -- Question to ask: "Can I pre-compute this once for all values?"-- If yes → transform to derived table/CTE with JOIN-- If no (depends on row-specific context) → ensure efficient indexStep 3: Measure Subquery in Isolation:
1234567891011121314151617181920
-- Extract and test the subquery with sample correlation values-- Original query has: WHERE e2.department_id = e.department_id -- Test with actual department values:EXPLAIN ANALYZESELECT AVG(salary) FROM employees WHERE department_id = 5;-- Time: 0.5 ms - fast individually EXPLAIN ANALYZESELECT AVG(salary) FROM employees WHERE department_id = 12;-- Time: 0.5 ms - consistent -- If 10,000 employees: 10,000 × 0.5ms = 5 seconds-- Solution: Pre-aggregate all departments once SELECT department_id, AVG(salary) as avg_salFROM employeesGROUP BY department_id;-- Time: 50 ms for ALL departments at once-- Then join: O(n) instead of O(n²)| Check | What to Look For | Action if Problematic |
|---|---|---|
| Execution count (loops) | loops > 1 for same SubPlan | Transform to derived table + JOIN |
| Correlation columns | Outer references in subquery | Pre-compute or ensure index exists |
| Missing index | Seq Scan in subquery on large table | Add index matching WHERE clause |
| Large materialization | 'Materialize' with many rows | Add filtering conditions inside subquery |
| Sort in subquery | Sort operation for MAX/MIN/ORDER BY | Add ordered index to avoid sort |
Run EXPLAIN ANALYZE and search for 'loops' > 1 on SubPlan nodes. If a SubPlan has loops=N where N is large, that's your bottleneck. Transform that specific subquery to a join or derived table pre-computation.
Subqueries are powerful but require understanding of their execution model to use efficiently. Let's consolidate the key optimization principles:
What's next:
The final page of this module explores expression optimization—how the way you write SQL expressions affects execution. You'll learn about sargability, function optimization, implicit conversions, and expression simplification for maximum performance.
You now understand subquery execution models, can diagnose correlated subquery performance problems, and know how to transform expensive subqueries into efficient joins or window functions. These skills let you write complex logic without sacrificing performance.