Database Management SystemsQuery Optimization Tips

Query Optimization Tips

LevelIntermediate

Duration75 mins

TopicQuery Optimization Tips

4 / 5

Subquery Optimization

The Nested Query Challenge

Subqueries are one of SQL's most powerful features—they let you nest one query inside another, composing complex logic from simple building blocks. But this composability comes with a hidden danger: subqueries can be the source of some of the most severe performance problems in database systems.

A subquery that works instantly for 100 rows might grind to a halt at 100,000 rows. The difference is rarely about the subquery itself—it's about how the database executes it, and whether that execution strategy scales with your data. Understanding subquery optimization transforms mysterious slow queries into predictable, efficient operations.

What You Will Learn

By the end of this page, you'll understand the critical difference between correlated and non-correlated subqueries, master strategies for transforming expensive subqueries, learn when subqueries outperform joins, and develop intuition for recognizing subquery performance traps.

Subquery Execution Models

The way a subquery is executed depends fundamentally on whether it references values from the outer query. This distinction—correlated vs. non-correlated—determines whether the subquery runs once or potentially millions of times.

Non-Correlated Subqueries (Run Once):

A non-correlated subquery is independent of the outer query. It produces a fixed result that the outer query uses like a constant or a lookup table.

Non-Correlated Subquery
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Non-correlated scalar subquery
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
 
-- Execution flow:
-- 1. Execute subquery ONCE: AVG(salary) = $75,000
-- 2. Substitute result: WHERE salary > 75000
-- 3. Execute outer query with constant filter
-- Total subquery executions: 1
 
-- Non-correlated IN subquery
SELECT product_name
FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE name LIKE 'Electronics%'
);
 
-- Execution flow:
-- 1. Execute subquery ONCE: Returns [5, 12, 23]
-- 2. Substitute: WHERE category_id IN (5, 12, 23)
-- 3. Execute as index lookup or hash semi-join
-- Total subquery executions: 1

Correlated Subqueries (Run Per Row):

A correlated subquery references columns from the outer query. It must be re-evaluated for each row the outer query considers.

Correlated Subquery
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Correlated scalar subquery
SELECT e1.employee_name, e1.salary, e1.department_id
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary) 
    FROM employees e2 
    WHERE e2.department_id = e1.department_id  -- Correlation!
);
 
-- Execution flow (naive):
-- For EACH employee in e1:
--   1. Execute subquery with e1.department_id = 'Sales' → $80K
--   2. Compare e1.salary > $80K
--   3. Move to next employee
--   4. Execute subquery with e1.department_id = 'Engineering' → $95K
--   5. Compare...
-- Total subquery executions: Same as number of employees!
 
-- With 100,000 employees: 100,000 subquery executions

Correlated vs Non-Correlated Impact
Aspect	Non-Correlated	Correlated (Naive)
Subquery executions	1	N (rows in outer query)
With 100K outer rows	1 execution	100,000 executions
With 1M outer rows	1 execution	1,000,000 executions
Time complexity	O(subquery) + O(outer)	O(outer × subquery)
Impact of slow subquery	Constant overhead	Multiplied by outer rows

The Correlation Trap

Correlated subqueries can be exponentially slower than they appear. A subquery that executes in 10ms becomes a 28-hour query when correlated across 10 million rows. Always examine whether your subquery correlates to the outer query.

Optimizer Subquery Transformations

Modern query optimizers apply sophisticated transformations to improve subquery execution. Understanding these transformations helps you write subqueries the optimizer can improve—or rewrite them yourself when it can't.

Subquery Flattening (Unnesting):

The optimizer converts subqueries into joins when possible, eliminating repeated execution:

Subquery Flattening
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Original: Correlated subquery
SELECT e.name
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary) 
    FROM employees e2 
    WHERE e2.department_id = e.department_id
);
 
-- Optimizer transforms to (conceptually):
SELECT e.name
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
) dept_avgs ON dept_avgs.department_id = e.department_id
WHERE e.salary > dept_avgs.avg_salary;
 
-- The subquery now executes ONCE (grouped), then joins
-- Instead of N subqueries, now: 1 aggregate + 1 join

Semi-Join Conversion:

IN and EXISTS subqueries are often converted to semi-joins:

Semi-Join Conversion
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Original: IN subquery
SELECT c.name
FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders WHERE total > 1000);
 
-- Optimizer may convert to semi-join:
SELECT c.name
FROM customers c
SEMI JOIN orders o ON o.customer_id = c.id AND o.total > 1000;
-- (SEMI JOIN is internal - returns c rows for which match exists)
 
-- Semi-join advantage:
-- - Stops at first match (doesn't return duplicates)
-- - Can use hash or nested loop strategies
-- - More optimization options than correlated subquery

Anti-Join Conversion:

NOT IN and NOT EXISTS convert to anti-joins:

Anti-Join Conversion
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Original: NOT EXISTS subquery
SELECT c.name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
 
-- Optimizer converts to anti-join:
SELECT c.name
FROM customers c
ANTI JOIN orders o ON o.customer_id = c.id;
-- (ANTI JOIN returns c rows with NO match in orders)
 
-- Anti-join strategies:
-- - Hash anti-join: Build hash of orders, probe with customers
-- - Merge anti-join: Sorted merge, return non-matches
-- - Nested loop anti-join: For each customer, seek in orders index

Check EXPLAIN Output

Use EXPLAIN to see if the optimizer transformed your subquery. Look for 'Semi Join', 'Anti Join', or subquery appearing as a derived table join. If you see 'SubPlan' or 'InitPlan' with loops = N, the subquery is executing repeatedly.

EXISTS vs. IN Optimization

Two common patterns for testing set membership—EXISTS and IN—have nuanced performance characteristics. Understanding when each excels helps you choose the optimal pattern.

EXISTS Behavior:

EXISTS returns true/false based on whether the subquery returns any rows. It's inherently efficient for existence checks:

EXISTS Optimization
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- EXISTS: Stops at first match
SELECT c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.id AND o.status = 'delivered'
);
 
-- For a customer with 1000 orders:
-- EXISTS: Checks orders until first 'delivered' found, stops
-- If first order matches, checks only 1 order
 
-- Index optimization:
-- With index on orders(customer_id, status):
-- EXISTS performs a single index seek per customer

IN Behavior:

IN compares against a set of values. Performance depends on subquery result size and optimization strategy:

IN Optimization
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- IN: Works with the result set
SELECT c.name
FROM customers c
WHERE c.id IN (
    SELECT customer_id FROM orders WHERE status = 'delivered'
);
 
-- Execution strategies:
-- 1. Materialize subquery, then hash semi-join
--    - Subquery runs once, builds hash table
--    - Probe with each customer
--    - Good for large subquery results
 
-- 2. Convert to correlated EXISTS (some optimizers)
--    - For each customer, check if in orders
--    - Good for small outer table
 
-- 3. Nested loop semi-join with index
--    - For each customer, index seek in subquery result
--    - Good when outer table is small

When to Use Each:

Prefer EXISTS When

•Subquery result is large/unknown size
•Can use efficient index in subquery
•Only need to know if match exists
•Dealing with nullable columns (NULL-safe)
•Correlated conditions in subquery

Prefer IN When

•Subquery result is small and known
•Subquery result is static/cached
•Simple equality comparison
•Already have the list of values
•Optimizer converts to efficient hash semi-join

Critical: IN with NULLs:

IN NULL Behavior
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- The NULL trap with NOT IN
SELECT c.name FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);
 
-- If ANY customer_id in orders is NULL:
-- NOT IN evaluates as: id <> 1 AND id <> 2 AND id <> NULL ...
-- Anything compared to NULL is UNKNOWN
-- UNKNOWN AND TRUE = UNKNOWN
-- WHERE UNKNOWN = No rows returned!
 
-- Solution 1: Filter NULLs
SELECT c.name FROM customers c
WHERE c.id NOT IN (
    SELECT customer_id FROM orders WHERE customer_id IS NOT NULL
);
 
-- Solution 2: Use NOT EXISTS (NULL-safe)
SELECT c.name FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- NULL in orders.customer_id never equals c.id, so it works correctly

NOT IN NULL Trap

NOT IN (subquery) returns NO ROWS if the subquery contains ANY NULL value. This is a frequent source of bugs. Use NOT EXISTS for NULL-safe anti-join semantics, or explicitly exclude NULLs from the subquery.

Scalar Subquery Optimization

Scalar subqueries return a single value and appear in SELECT lists, WHERE clauses, or expressions. Their optimization is critical because correlated scalar subqueries can execute once per output row.

Scalar Subquery in SELECT:

Scalar Subquery in SELECT
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Problem: Correlated scalar subquery in SELECT
SELECT 
    c.name,
    c.id,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) as order_count,
    (SELECT MAX(o.total) FROM orders o WHERE o.customer_id = c.id) as max_order,
    (SELECT SUM(o.total) FROM orders o WHERE o.customer_id = c.id) as total_spent
FROM customers c;
 
-- For 100,000 customers: 300,000 subquery executions!
-- Each subquery scans orders table (or uses index 100K times)
 
-- Solution: Aggregate once, join
SELECT 
    c.name,
    c.id,
    COALESCE(stats.order_count, 0) as order_count,
    stats.max_order,
    COALESCE(stats.total_spent, 0) as total_spent
FROM customers c
LEFT JOIN (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        MAX(total) as max_order,
        SUM(total) as total_spent
    FROM orders
    GROUP BY customer_id
) stats ON stats.customer_id = c.id;
 
-- One aggregation pass over orders, then one hash join
-- Complexity: O(orders) + O(customers) instead of O(customers × orders)

Scalar Subquery in WHERE:

Scalar Subquery in WHERE
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Non-correlated: Runs once (efficient)
SELECT name FROM products
WHERE price > (SELECT AVG(price) FROM products);
-- Subquery result is constant, optimizer executes once
 
-- Correlated: Runs per row (expensive)
SELECT e.name FROM employees e
WHERE e.salary > (
    SELECT AVG(salary) FROM employees e2 
    WHERE e2.department_id = e.department_id
);
 
-- Transformation to join:
SELECT e.name
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) as avg_sal
    FROM employees
    GROUP BY department_id
) dept_avg ON dept_avg.department_id = e.department_id
WHERE e.salary > dept_avg.avg_sal;

Multiple Scalar Subqueries Consolidation:

When multiple scalar subqueries reference the same table, consolidate them:

Consolidating Scalar Subqueries
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Before: 4 separate subqueries
SELECT 
    product_id,
    (SELECT name FROM products p WHERE p.id = oi.product_id) as name,
    (SELECT price FROM products p WHERE p.id = oi.product_id) as price,
    (SELECT category FROM products p WHERE p.id = oi.product_id) as category,
    (SELECT manufacturer FROM products p WHERE p.id = oi.product_id) as mfr
FROM order_items oi;
 
-- After: Single join
SELECT 
    oi.product_id,
    p.name,
    p.price,
    p.category,
    p.manufacturer
FROM order_items oi
JOIN products p ON p.id = oi.product_id;
 
-- Even better with CTE for clarity:
WITH product_details AS (
    SELECT id, name, price, category, manufacturer
    FROM products
)
SELECT 
    oi.product_id,
    pd.name,
    pd.price,
    pd.category,
    pd.manufacturer
FROM order_items oi
JOIN product_details pd ON pd.id = oi.product_id;

Scalar Subquery Caching

Some optimizers cache scalar subquery results for repeated correlation values. PostgreSQL's 'SubPlan' shows 'Loops: N' but may cache results for repeated department_id values. However, don't rely on this—explicit transformation is more reliable.

Derived Table Optimization

Derived tables (subqueries in FROM clause) create temporary result sets. Their optimization depends on when they're materialized versus inlined.

Materialization vs. Merge:

Derived Table Optimization
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Simple derived table (often merged/inlined)
SELECT * 
FROM (SELECT id, name FROM customers WHERE status = 'active') c
WHERE c.name LIKE 'A%';
 
-- Optimizer typically merges into:
SELECT id, name FROM customers
WHERE status = 'active' AND name LIKE 'A%';
-- Both filters applied together, single index scan if available
 
-- Complex derived table (must materialize)
SELECT dt.category, dt.avg_price
FROM (
    SELECT category, AVG(price) as avg_price
    FROM products
    GROUP BY category
    HAVING AVG(price) > 100
) dt
WHERE dt.category IN ('Electronics', 'Books');
 
-- Structure forces materialization:
-- 1. Execute full aggregation (can't push category filter into GROUP BY)
-- 2. Materialize result with categories and averages
-- 3. Then filter for specific categories

Optimization Blockers:

Certain patterns prevent derived table optimization:

Derived Table Optimization Blockers
Pattern	Why It Blocks Optimization	Alternative
DISTINCT in derived table	Must deduplicate before join	Move DISTINCT to final query if possible
GROUP BY in derived table	Must aggregate before join	Consider window functions or join first
TOP/LIMIT in derived table	Must materialize partial result	Apply LIMIT to final query when possible
UNION in derived table	Must combine sets before join	Restructure as separate queries with UNION ALL
Aggregate without correlation	Compute whole aggregate first	Use correlated aggregate or window function

Strategic Derived Table Use:

Derived tables can help optimization when used strategically:

Strategic Derived Tables
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Good use: Pre-filter before expensive join
SELECT c.name, recent_orders.*
FROM customers c
JOIN (
    SELECT * FROM orders 
    WHERE order_date > CURRENT_DATE - 30
) recent_orders ON recent_orders.customer_id = c.id;
-- Forces optimizer to filter orders first, then join
 
-- Good use: Reduce join cardinality
SELECT c.name, top_order.total
FROM customers c
JOIN (
    SELECT DISTINCT ON (customer_id) customer_id, total
    FROM orders
    ORDER BY customer_id, total DESC
) top_order ON top_order.customer_id = c.id;
-- Each customer joins to exactly one row
 
-- Good use: Encapsulate complex logic
SELECT *
FROM (
    SELECT 
        customer_id,
        SUM(total) as lifetime_value,
        NTILE(4) OVER (ORDER BY SUM(total) DESC) as quartile
    FROM orders
    GROUP BY customer_id
) customer_tiers
WHERE quartile = 1;  -- Top 25% of customers

CTE vs. Derived Table

CTEs (WITH clause) and derived tables often produce identical plans. However, some databases (older PostgreSQL, SQLite) materialize CTEs by default, while derived tables are more likely to be inlined. Check EXPLAIN for your specific database version.

Subquery to Join Transformations

When the optimizer can't transform a subquery efficiently, manual transformation to joins often provides dramatic speedups.

Pattern 1: Correlated Aggregate to JOIN:

Aggregate Transformation
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Before: Correlated subquery (per-row execution)
SELECT p.name, p.price,
       p.price - (SELECT AVG(p2.price) FROM products p2 
                  WHERE p2.category_id = p.category_id) as diff_from_avg
FROM products p;
 
-- After: Pre-aggregate and join
SELECT p.name, p.price, p.price - cat_avg.avg_price as diff_from_avg
FROM products p
JOIN (
    SELECT category_id, AVG(price) as avg_price
    FROM products
    GROUP BY category_id
) cat_avg ON cat_avg.category_id = p.category_id;
 
-- Or using window function (often most elegant):
SELECT name, price,
       price - AVG(price) OVER (PARTITION BY category_id) as diff_from_avg
FROM products;

Pattern 2: EXISTS to JOIN:

EXISTS to JOIN
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Before: EXISTS subquery
SELECT DISTINCT c.id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.id AND o.total > 1000
);
 
-- After: JOIN with DISTINCT
SELECT DISTINCT c.id, c.name
FROM customers c
JOIN orders o ON o.customer_id = c.id 
WHERE o.total > 1000;
 
-- Or semi-join syntax where supported (cleaner semantics):
SELECT c.id, c.name
FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders WHERE total > 1000);
-- Modern optimizers convert this to semi-join automatically

Pattern 3: NOT EXISTS to LEFT JOIN + IS NULL:

NOT EXISTS to LEFT JOIN
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Before: NOT EXISTS subquery
SELECT c.id, c.name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
 
-- After: LEFT JOIN + IS NULL check
SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.customer_id IS NULL;
 
-- Which is faster depends on:
-- - Optimizer capabilities (modern ones: similar performance)
-- - Available indexes
-- - Table sizes and selectivity
 
-- Test both with EXPLAIN ANALYZE in your environment

Pattern 4: Aggregating TOP-N to Window Function:

TOP-N Transformation
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Before: Correlated subquery for "latest order per customer"
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.order_date = (
    SELECT MAX(order_date) 
    FROM orders o2 
    WHERE o2.customer_id = c.id
);
-- Subquery runs for each order row!
 
-- After: Window function (single pass)
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN (
    SELECT order_id, customer_id, total, order_date,
           ROW_NUMBER() OVER (
               PARTITION BY customer_id 
               ORDER BY order_date DESC
           ) as rn
    FROM orders
) o ON o.customer_id = c.id AND o.rn = 1;
 
-- Even cleaner with CTE:
WITH latest_orders AS (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
    FROM orders
)
SELECT c.name, lo.order_id, lo.total
FROM customers c
JOIN latest_orders lo ON lo.customer_id = c.id AND lo.rn = 1;

Transformation Testing

Always verify transformations produce identical results with test data. Subtle semantic differences (NULLs, duplicates, edge cases) can cause transformations to produce different results. Compare COUNT(*) and spot-check sample rows.

When Subqueries Outperform Joins

Despite general guidance to prefer joins, subqueries sometimes outperform their join equivalents. Understanding these cases prevents unnecessary refactoring.

Case 1: Early Filtering with EXISTS:

EXISTS Early Exit
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Scenario: Find customers who ordered product X
-- Customer 12345 has 50,000 orders, product X is in order #5
 
-- JOIN approach: Scans all matching orders
SELECT DISTINCT c.name
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE oi.product_id = 'X' AND c.id = 12345;
-- Joins all 50,000 orders, then filters to product X
 
-- EXISTS approach: Stops at first match
SELECT c.name
FROM customers c
WHERE c.id = 12345
  AND EXISTS (
    SELECT 1 FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    WHERE o.customer_id = c.id AND oi.product_id = 'X'
);
-- Finds product X in order #5, immediately returns

Case 2: Avoiding Cardinality Explosion:

Cardinality Control
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Scenario: Count customers who have any order with items
 
-- JOIN approach: Cardinality explosion
SELECT COUNT(DISTINCT c.id)
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.order_id;
-- If average customer has 100 orders with 10 items each:
-- 10,000 customers × 100 orders × 10 items = 10 million intermediate rows
-- Then COUNT DISTINCT de-duplicates expensively
 
-- EXISTS approach: No explosion
SELECT COUNT(*)
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    WHERE o.customer_id = c.id
);
-- 10,000 customers × 1 existence check each = 10,000 checks
-- Each check stops at first match

Case 3: Scalar Subquery for Single Value:

Single Value Subquery
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Scenario: Show each product with the global average price
 
-- JOIN approach: Compute Cartesian product
SELECT p.name, p.price, avg_data.global_avg
FROM products p
CROSS JOIN (SELECT AVG(price) as global_avg FROM products) avg_data;
-- Semantic: Cross join suggests Cartesian, even though result is 1 row
 
-- Scalar subquery: Cleaner, equally efficient
SELECT p.name, p.price, (SELECT AVG(price) FROM products) as global_avg
FROM products p;
-- The subquery executes once, result used as constant
-- Cleaner intent: "compute this one value"
 
-- Both produce identical plans in most optimizers

When to Prefer Subqueries

•Existence checks where you don't need related data, only yes/no
•Anti-joins for finding non-matches (NOT EXISTS often cleaner)
•Single scalar values used as constants throughout the query
•Avoiding duplicates when join would require DISTINCT removal
•Semantic clarity when subquery expresses intent more clearly

Measure, Don't Assume

Modern optimizers often transform subqueries and joins into equivalent plans. The 'subquery is slow' advice originated when optimizers were less sophisticated. Always measure with EXPLAIN ANALYZE rather than assuming one form is faster.

Diagnosing Slow Subqueries

When subqueries cause performance problems, systematic diagnosis reveals the root cause.

Step 1: Check Execution Count

Diagnosing Subquery Execution
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PostgreSQL EXPLAIN ANALYZE reveals loop count
EXPLAIN ANALYZE
SELECT e.name
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary) FROM employees e2 
    WHERE e2.department_id = e.department_id
);
 
-- Look for "SubPlan" or "InitPlan" sections:
-- SubPlan 1 (returns $2)
--   Aggregate  (actual time=0.1..0.1 rows=1 loops=10000)  <-- 10000 executions!
--     Seq Scan on employees e2
 
-- For non-correlated subquery, you'd see loops=1
 
-- MySQL EXPLAIN shows SELECT type:
-- DEPENDENT SUBQUERY = correlated (bad if outer table large)
-- SUBQUERY = non-correlated (good, runs once)

Step 2: Identify the Correlation:

Finding Correlation
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- The correlation is the outer reference in the subquery
SELECT o.order_id
FROM orders o
WHERE o.total > (
    SELECT AVG(o2.total) 
    FROM orders o2 
    WHERE o2.customer_id = o.customer_id   -- This is the correlation
      AND o2.order_date > o.order_date     -- Another correlation
);
 
-- Each outer reference creates a dependency:
-- The subquery depends on o.customer_id and o.order_date from outer query
-- These values change for each row, forcing re-execution
 
-- Question to ask: "Can I pre-compute this once for all values?"
-- If yes → transform to derived table/CTE with JOIN
-- If no (depends on row-specific context) → ensure efficient index

Step 3: Measure Subquery in Isolation:

Isolated Subquery Test
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Extract and test the subquery with sample correlation values
-- Original query has: WHERE e2.department_id = e.department_id
 
-- Test with actual department values:
EXPLAIN ANALYZE
SELECT AVG(salary) FROM employees WHERE department_id = 5;
-- Time: 0.5 ms - fast individually
 
EXPLAIN ANALYZE
SELECT AVG(salary) FROM employees WHERE department_id = 12;
-- Time: 0.5 ms - consistent
 
-- If 10,000 employees: 10,000 × 0.5ms = 5 seconds
-- Solution: Pre-aggregate all departments once
 
SELECT department_id, AVG(salary) as avg_sal
FROM employees
GROUP BY department_id;
-- Time: 50 ms for ALL departments at once
-- Then join: O(n) instead of O(n²)

Subquery Performance Diagnosis Checklist
Check	What to Look For	Action if Problematic
Execution count (loops)	loops > 1 for same SubPlan	Transform to derived table + JOIN
Correlation columns	Outer references in subquery	Pre-compute or ensure index exists
Missing index	Seq Scan in subquery on large table	Add index matching WHERE clause
Large materialization	'Materialize' with many rows	Add filtering conditions inside subquery
Sort in subquery	Sort operation for MAX/MIN/ORDER BY	Add ordered index to avoid sort

Quick Diagnosis Pattern

Run EXPLAIN ANALYZE and search for 'loops' > 1 on SubPlan nodes. If a SubPlan has loops=N where N is large, that's your bottleneck. Transform that specific subquery to a join or derived table pre-computation.

Summary: Subquery Optimization Mastery

Subqueries are powerful but require understanding of their execution model to use efficiently. Let's consolidate the key optimization principles:

Key Takeaways

•Correlated subqueries execute per row — they're O(outer × subquery) complexity. Non-correlated subqueries execute once as O(subquery + outer).
•Modern optimizers transform many patterns — check EXPLAIN to see if your subquery was converted to a semi-join or merged. Don't assume manual transformation is needed.
•EXISTS stops at first match — ideal for existence checks. IN materializes the full result set—better for small, static sets.
•NOT IN has NULL traps — a single NULL causes zero results. Use NOT EXISTS for NULL-safe anti-join semantics.
•Scalar subqueries consolidate — multiple scalar subqueries on the same table should become a single JOIN or window function.
•Measure execution count — EXPLAIN ANALYZE shows 'loops' for SubPlans. High loop count reveals the transformation opportunity.

What's next:

The final page of this module explores expression optimization—how the way you write SQL expressions affects execution. You'll learn about sargability, function optimization, implicit conversions, and expression simplification for maximum performance.

Page Complete

You now understand subquery execution models, can diagnose correlated subquery performance problems, and know how to transform expensive subqueries into efficient joins or window functions. These skills let you write complex logic without sacrificing performance.

4 / 5

Loading learning content...

Database Management SystemsQuery Optimization Tips

Query Optimization Tips

LevelIntermediate

Duration75 mins

TopicQuery Optimization Tips

4 / 5

Subquery Optimization

The Nested Query Challenge

What You Will Learn

Subquery Execution Models

Non-Correlated Subqueries (Run Once):

A non-correlated subquery is independent of the outer query. It produces a fixed result that the outer query uses like a constant or a lookup table.

Non-Correlated Subquery
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Non-correlated scalar subquery
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
 
-- Execution flow:
-- 1. Execute subquery ONCE: AVG(salary) = $75,000
-- 2. Substitute result: WHERE salary > 75000
-- 3. Execute outer query with constant filter
-- Total subquery executions: 1
 
-- Non-correlated IN subquery
SELECT product_name
FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE name LIKE 'Electronics%'
);
 
-- Execution flow:
-- 1. Execute subquery ONCE: Returns [5, 12, 23]
-- 2. Substitute: WHERE category_id IN (5, 12, 23)
-- 3. Execute as index lookup or hash semi-join
-- Total subquery executions: 1

Correlated Subqueries (Run Per Row):

A correlated subquery references columns from the outer query. It must be re-evaluated for each row the outer query considers.

Correlated Subquery
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Correlated scalar subquery
SELECT e1.employee_name, e1.salary, e1.department_id
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary) 
    FROM employees e2 
    WHERE e2.department_id = e1.department_id  -- Correlation!
);
 
-- Execution flow (naive):
-- For EACH employee in e1:
--   1. Execute subquery with e1.department_id = 'Sales' → $80K
--   2. Compare e1.salary > $80K
--   3. Move to next employee
--   4. Execute subquery with e1.department_id = 'Engineering' → $95K
--   5. Compare...
-- Total subquery executions: Same as number of employees!
 
-- With 100,000 employees: 100,000 subquery executions

Correlated vs Non-Correlated Impact
Aspect	Non-Correlated	Correlated (Naive)
Subquery executions	1	N (rows in outer query)
With 100K outer rows	1 execution	100,000 executions
With 1M outer rows	1 execution	1,000,000 executions
Time complexity	O(subquery) + O(outer)	O(outer × subquery)
Impact of slow subquery	Constant overhead	Multiplied by outer rows

The Correlation Trap

Optimizer Subquery Transformations

Subquery Flattening (Unnesting):

The optimizer converts subqueries into joins when possible, eliminating repeated execution:

Subquery Flattening
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Original: Correlated subquery
SELECT e.name
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary) 
    FROM employees e2 
    WHERE e2.department_id = e.department_id
);
 
-- Optimizer transforms to (conceptually):
SELECT e.name
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
) dept_avgs ON dept_avgs.department_id = e.department_id
WHERE e.salary > dept_avgs.avg_salary;
 
-- The subquery now executes ONCE (grouped), then joins
-- Instead of N subqueries, now: 1 aggregate + 1 join

Semi-Join Conversion:

IN and EXISTS subqueries are often converted to semi-joins:

Semi-Join Conversion
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Original: IN subquery
SELECT c.name
FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders WHERE total > 1000);
 
-- Optimizer may convert to semi-join:
SELECT c.name
FROM customers c
SEMI JOIN orders o ON o.customer_id = c.id AND o.total > 1000;
-- (SEMI JOIN is internal - returns c rows for which match exists)
 
-- Semi-join advantage:
-- - Stops at first match (doesn't return duplicates)
-- - Can use hash or nested loop strategies
-- - More optimization options than correlated subquery

Anti-Join Conversion:

NOT IN and NOT EXISTS convert to anti-joins:

Anti-Join Conversion
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Original: NOT EXISTS subquery
SELECT c.name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
 
-- Optimizer converts to anti-join:
SELECT c.name
FROM customers c
ANTI JOIN orders o ON o.customer_id = c.id;
-- (ANTI JOIN returns c rows with NO match in orders)
 
-- Anti-join strategies:
-- - Hash anti-join: Build hash of orders, probe with customers
-- - Merge anti-join: Sorted merge, return non-matches
-- - Nested loop anti-join: For each customer, seek in orders index

Check EXPLAIN Output

EXISTS vs. IN Optimization

Two common patterns for testing set membership—EXISTS and IN—have nuanced performance characteristics. Understanding when each excels helps you choose the optimal pattern.

EXISTS Behavior:

EXISTS returns true/false based on whether the subquery returns any rows. It's inherently efficient for existence checks:

EXISTS Optimization
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- EXISTS: Stops at first match
SELECT c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.id AND o.status = 'delivered'
);
 
-- For a customer with 1000 orders:
-- EXISTS: Checks orders until first 'delivered' found, stops
-- If first order matches, checks only 1 order
 
-- Index optimization:
-- With index on orders(customer_id, status):
-- EXISTS performs a single index seek per customer

IN Behavior:

IN compares against a set of values. Performance depends on subquery result size and optimization strategy:

IN Optimization
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- IN: Works with the result set
SELECT c.name
FROM customers c
WHERE c.id IN (
    SELECT customer_id FROM orders WHERE status = 'delivered'
);
 
-- Execution strategies:
-- 1. Materialize subquery, then hash semi-join
--    - Subquery runs once, builds hash table
--    - Probe with each customer
--    - Good for large subquery results
 
-- 2. Convert to correlated EXISTS (some optimizers)
--    - For each customer, check if in orders
--    - Good for small outer table
 
-- 3. Nested loop semi-join with index
--    - For each customer, index seek in subquery result
--    - Good when outer table is small

When to Use Each:

Prefer EXISTS When

•Subquery result is large/unknown size
•Can use efficient index in subquery
•Only need to know if match exists
•Dealing with nullable columns (NULL-safe)
•Correlated conditions in subquery

Prefer IN When

•Subquery result is small and known
•Subquery result is static/cached
•Simple equality comparison
•Already have the list of values
•Optimizer converts to efficient hash semi-join

Critical: IN with NULLs:

IN NULL Behavior
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- The NULL trap with NOT IN
SELECT c.name FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);
 
-- If ANY customer_id in orders is NULL:
-- NOT IN evaluates as: id <> 1 AND id <> 2 AND id <> NULL ...
-- Anything compared to NULL is UNKNOWN
-- UNKNOWN AND TRUE = UNKNOWN
-- WHERE UNKNOWN = No rows returned!
 
-- Solution 1: Filter NULLs
SELECT c.name FROM customers c
WHERE c.id NOT IN (
    SELECT customer_id FROM orders WHERE customer_id IS NOT NULL
);
 
-- Solution 2: Use NOT EXISTS (NULL-safe)
SELECT c.name FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- NULL in orders.customer_id never equals c.id, so it works correctly

NOT IN NULL Trap

Scalar Subquery Optimization

Scalar subqueries return a single value and appear in SELECT lists, WHERE clauses, or expressions. Their optimization is critical because correlated scalar subqueries can execute once per output row.

Scalar Subquery in SELECT:

Scalar Subquery in SELECT
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Problem: Correlated scalar subquery in SELECT
SELECT 
    c.name,
    c.id,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) as order_count,
    (SELECT MAX(o.total) FROM orders o WHERE o.customer_id = c.id) as max_order,
    (SELECT SUM(o.total) FROM orders o WHERE o.customer_id = c.id) as total_spent
FROM customers c;
 
-- For 100,000 customers: 300,000 subquery executions!
-- Each subquery scans orders table (or uses index 100K times)
 
-- Solution: Aggregate once, join
SELECT 
    c.name,
    c.id,
    COALESCE(stats.order_count, 0) as order_count,
    stats.max_order,
    COALESCE(stats.total_spent, 0) as total_spent
FROM customers c
LEFT JOIN (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        MAX(total) as max_order,
        SUM(total) as total_spent
    FROM orders
    GROUP BY customer_id
) stats ON stats.customer_id = c.id;
 
-- One aggregation pass over orders, then one hash join
-- Complexity: O(orders) + O(customers) instead of O(customers × orders)

Scalar Subquery in WHERE:

Scalar Subquery in WHERE
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Non-correlated: Runs once (efficient)
SELECT name FROM products
WHERE price > (SELECT AVG(price) FROM products);
-- Subquery result is constant, optimizer executes once
 
-- Correlated: Runs per row (expensive)
SELECT e.name FROM employees e
WHERE e.salary > (
    SELECT AVG(salary) FROM employees e2 
    WHERE e2.department_id = e.department_id
);
 
-- Transformation to join:
SELECT e.name
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) as avg_sal
    FROM employees
    GROUP BY department_id
) dept_avg ON dept_avg.department_id = e.department_id
WHERE e.salary > dept_avg.avg_sal;

Multiple Scalar Subqueries Consolidation:

When multiple scalar subqueries reference the same table, consolidate them:

Consolidating Scalar Subqueries
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Before: 4 separate subqueries
SELECT 
    product_id,
    (SELECT name FROM products p WHERE p.id = oi.product_id) as name,
    (SELECT price FROM products p WHERE p.id = oi.product_id) as price,
    (SELECT category FROM products p WHERE p.id = oi.product_id) as category,
    (SELECT manufacturer FROM products p WHERE p.id = oi.product_id) as mfr
FROM order_items oi;
 
-- After: Single join
SELECT 
    oi.product_id,
    p.name,
    p.price,
    p.category,
    p.manufacturer
FROM order_items oi
JOIN products p ON p.id = oi.product_id;
 
-- Even better with CTE for clarity:
WITH product_details AS (
    SELECT id, name, price, category, manufacturer
    FROM products
)
SELECT 
    oi.product_id,
    pd.name,
    pd.price,
    pd.category,
    pd.manufacturer
FROM order_items oi
JOIN product_details pd ON pd.id = oi.product_id;

Scalar Subquery Caching

Derived Table Optimization

Derived tables (subqueries in FROM clause) create temporary result sets. Their optimization depends on when they're materialized versus inlined.

Materialization vs. Merge:

Derived Table Optimization
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Simple derived table (often merged/inlined)
SELECT * 
FROM (SELECT id, name FROM customers WHERE status = 'active') c
WHERE c.name LIKE 'A%';
 
-- Optimizer typically merges into:
SELECT id, name FROM customers
WHERE status = 'active' AND name LIKE 'A%';
-- Both filters applied together, single index scan if available
 
-- Complex derived table (must materialize)
SELECT dt.category, dt.avg_price
FROM (
    SELECT category, AVG(price) as avg_price
    FROM products
    GROUP BY category
    HAVING AVG(price) > 100
) dt
WHERE dt.category IN ('Electronics', 'Books');
 
-- Structure forces materialization:
-- 1. Execute full aggregation (can't push category filter into GROUP BY)
-- 2. Materialize result with categories and averages
-- 3. Then filter for specific categories

Optimization Blockers:

Certain patterns prevent derived table optimization:

Derived Table Optimization Blockers
Pattern	Why It Blocks Optimization	Alternative
DISTINCT in derived table	Must deduplicate before join	Move DISTINCT to final query if possible
GROUP BY in derived table	Must aggregate before join	Consider window functions or join first
TOP/LIMIT in derived table	Must materialize partial result	Apply LIMIT to final query when possible
UNION in derived table	Must combine sets before join	Restructure as separate queries with UNION ALL
Aggregate without correlation	Compute whole aggregate first	Use correlated aggregate or window function

Strategic Derived Table Use:

Derived tables can help optimization when used strategically:

Strategic Derived Tables
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Good use: Pre-filter before expensive join
SELECT c.name, recent_orders.*
FROM customers c
JOIN (
    SELECT * FROM orders 
    WHERE order_date > CURRENT_DATE - 30
) recent_orders ON recent_orders.customer_id = c.id;
-- Forces optimizer to filter orders first, then join
 
-- Good use: Reduce join cardinality
SELECT c.name, top_order.total
FROM customers c
JOIN (
    SELECT DISTINCT ON (customer_id) customer_id, total
    FROM orders
    ORDER BY customer_id, total DESC
) top_order ON top_order.customer_id = c.id;
-- Each customer joins to exactly one row
 
-- Good use: Encapsulate complex logic
SELECT *
FROM (
    SELECT 
        customer_id,
        SUM(total) as lifetime_value,
        NTILE(4) OVER (ORDER BY SUM(total) DESC) as quartile
    FROM orders
    GROUP BY customer_id
) customer_tiers
WHERE quartile = 1;  -- Top 25% of customers

CTE vs. Derived Table

Subquery to Join Transformations

When the optimizer can't transform a subquery efficiently, manual transformation to joins often provides dramatic speedups.

Pattern 1: Correlated Aggregate to JOIN:

Aggregate Transformation
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Before: Correlated subquery (per-row execution)
SELECT p.name, p.price,
       p.price - (SELECT AVG(p2.price) FROM products p2 
                  WHERE p2.category_id = p.category_id) as diff_from_avg
FROM products p;
 
-- After: Pre-aggregate and join
SELECT p.name, p.price, p.price - cat_avg.avg_price as diff_from_avg
FROM products p
JOIN (
    SELECT category_id, AVG(price) as avg_price
    FROM products
    GROUP BY category_id
) cat_avg ON cat_avg.category_id = p.category_id;
 
-- Or using window function (often most elegant):
SELECT name, price,
       price - AVG(price) OVER (PARTITION BY category_id) as diff_from_avg
FROM products;

Pattern 2: EXISTS to JOIN:

EXISTS to JOIN
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Before: EXISTS subquery
SELECT DISTINCT c.id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.id AND o.total > 1000
);
 
-- After: JOIN with DISTINCT
SELECT DISTINCT c.id, c.name
FROM customers c
JOIN orders o ON o.customer_id = c.id 
WHERE o.total > 1000;
 
-- Or semi-join syntax where supported (cleaner semantics):
SELECT c.id, c.name
FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders WHERE total > 1000);
-- Modern optimizers convert this to semi-join automatically

Pattern 3: NOT EXISTS to LEFT JOIN + IS NULL:

NOT EXISTS to LEFT JOIN
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Before: NOT EXISTS subquery
SELECT c.id, c.name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
 
-- After: LEFT JOIN + IS NULL check
SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.customer_id IS NULL;
 
-- Which is faster depends on:
-- - Optimizer capabilities (modern ones: similar performance)
-- - Available indexes
-- - Table sizes and selectivity
 
-- Test both with EXPLAIN ANALYZE in your environment

Pattern 4: Aggregating TOP-N to Window Function:

TOP-N Transformation
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Before: Correlated subquery for "latest order per customer"
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.order_date = (
    SELECT MAX(order_date) 
    FROM orders o2 
    WHERE o2.customer_id = c.id
);
-- Subquery runs for each order row!
 
-- After: Window function (single pass)
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN (
    SELECT order_id, customer_id, total, order_date,
           ROW_NUMBER() OVER (
               PARTITION BY customer_id 
               ORDER BY order_date DESC
           ) as rn
    FROM orders
) o ON o.customer_id = c.id AND o.rn = 1;
 
-- Even cleaner with CTE:
WITH latest_orders AS (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
    FROM orders
)
SELECT c.name, lo.order_id, lo.total
FROM customers c
JOIN latest_orders lo ON lo.customer_id = c.id AND lo.rn = 1;

Transformation Testing

When Subqueries Outperform Joins

Despite general guidance to prefer joins, subqueries sometimes outperform their join equivalents. Understanding these cases prevents unnecessary refactoring.

Case 1: Early Filtering with EXISTS:

EXISTS Early Exit
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Scenario: Find customers who ordered product X
-- Customer 12345 has 50,000 orders, product X is in order #5
 
-- JOIN approach: Scans all matching orders
SELECT DISTINCT c.name
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE oi.product_id = 'X' AND c.id = 12345;
-- Joins all 50,000 orders, then filters to product X
 
-- EXISTS approach: Stops at first match
SELECT c.name
FROM customers c
WHERE c.id = 12345
  AND EXISTS (
    SELECT 1 FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    WHERE o.customer_id = c.id AND oi.product_id = 'X'
);
-- Finds product X in order #5, immediately returns

Case 2: Avoiding Cardinality Explosion:

Cardinality Control
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Scenario: Count customers who have any order with items
 
-- JOIN approach: Cardinality explosion
SELECT COUNT(DISTINCT c.id)
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.order_id;
-- If average customer has 100 orders with 10 items each:
-- 10,000 customers × 100 orders × 10 items = 10 million intermediate rows
-- Then COUNT DISTINCT de-duplicates expensively
 
-- EXISTS approach: No explosion
SELECT COUNT(*)
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    WHERE o.customer_id = c.id
);
-- 10,000 customers × 1 existence check each = 10,000 checks
-- Each check stops at first match

Case 3: Scalar Subquery for Single Value:

Single Value Subquery
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Scenario: Show each product with the global average price
 
-- JOIN approach: Compute Cartesian product
SELECT p.name, p.price, avg_data.global_avg
FROM products p
CROSS JOIN (SELECT AVG(price) as global_avg FROM products) avg_data;
-- Semantic: Cross join suggests Cartesian, even though result is 1 row
 
-- Scalar subquery: Cleaner, equally efficient
SELECT p.name, p.price, (SELECT AVG(price) FROM products) as global_avg
FROM products p;
-- The subquery executes once, result used as constant
-- Cleaner intent: "compute this one value"
 
-- Both produce identical plans in most optimizers

When to Prefer Subqueries

•Existence checks where you don't need related data, only yes/no
•Anti-joins for finding non-matches (NOT EXISTS often cleaner)
•Single scalar values used as constants throughout the query
•Avoiding duplicates when join would require DISTINCT removal
•Semantic clarity when subquery expresses intent more clearly

Measure, Don't Assume

Diagnosing Slow Subqueries

When subqueries cause performance problems, systematic diagnosis reveals the root cause.

Step 1: Check Execution Count

Diagnosing Subquery Execution
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PostgreSQL EXPLAIN ANALYZE reveals loop count
EXPLAIN ANALYZE
SELECT e.name
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary) FROM employees e2 
    WHERE e2.department_id = e.department_id
);
 
-- Look for "SubPlan" or "InitPlan" sections:
-- SubPlan 1 (returns $2)
--   Aggregate  (actual time=0.1..0.1 rows=1 loops=10000)  <-- 10000 executions!
--     Seq Scan on employees e2
 
-- For non-correlated subquery, you'd see loops=1
 
-- MySQL EXPLAIN shows SELECT type:
-- DEPENDENT SUBQUERY = correlated (bad if outer table large)
-- SUBQUERY = non-correlated (good, runs once)

Step 2: Identify the Correlation:

Finding Correlation
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- The correlation is the outer reference in the subquery
SELECT o.order_id
FROM orders o
WHERE o.total > (
    SELECT AVG(o2.total) 
    FROM orders o2 
    WHERE o2.customer_id = o.customer_id   -- This is the correlation
      AND o2.order_date > o.order_date     -- Another correlation
);
 
-- Each outer reference creates a dependency:
-- The subquery depends on o.customer_id and o.order_date from outer query
-- These values change for each row, forcing re-execution
 
-- Question to ask: "Can I pre-compute this once for all values?"
-- If yes → transform to derived table/CTE with JOIN
-- If no (depends on row-specific context) → ensure efficient index

Step 3: Measure Subquery in Isolation:

Isolated Subquery Test
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Extract and test the subquery with sample correlation values
-- Original query has: WHERE e2.department_id = e.department_id
 
-- Test with actual department values:
EXPLAIN ANALYZE
SELECT AVG(salary) FROM employees WHERE department_id = 5;
-- Time: 0.5 ms - fast individually
 
EXPLAIN ANALYZE
SELECT AVG(salary) FROM employees WHERE department_id = 12;
-- Time: 0.5 ms - consistent
 
-- If 10,000 employees: 10,000 × 0.5ms = 5 seconds
-- Solution: Pre-aggregate all departments once
 
SELECT department_id, AVG(salary) as avg_sal
FROM employees
GROUP BY department_id;
-- Time: 50 ms for ALL departments at once
-- Then join: O(n) instead of O(n²)

Subquery Performance Diagnosis Checklist
Check	What to Look For	Action if Problematic
Execution count (loops)	loops > 1 for same SubPlan	Transform to derived table + JOIN
Correlation columns	Outer references in subquery	Pre-compute or ensure index exists
Missing index	Seq Scan in subquery on large table	Add index matching WHERE clause
Large materialization	'Materialize' with many rows	Add filtering conditions inside subquery
Sort in subquery	Sort operation for MAX/MIN/ORDER BY	Add ordered index to avoid sort

Quick Diagnosis Pattern

Summary: Subquery Optimization Mastery

Subqueries are powerful but require understanding of their execution model to use efficiently. Let's consolidate the key optimization principles:

Key Takeaways

•Correlated subqueries execute per row — they're O(outer × subquery) complexity. Non-correlated subqueries execute once as O(subquery + outer).
•Modern optimizers transform many patterns — check EXPLAIN to see if your subquery was converted to a semi-join or merged. Don't assume manual transformation is needed.
•EXISTS stops at first match — ideal for existence checks. IN materializes the full result set—better for small, static sets.
•NOT IN has NULL traps — a single NULL causes zero results. Use NOT EXISTS for NULL-safe anti-join semantics.
•Scalar subqueries consolidate — multiple scalar subqueries on the same table should become a single JOIN or window function.
•Measure execution count — EXPLAIN ANALYZE shows 'loops' for SubPlans. High loop count reveals the transformation opportunity.

What's next:

Page Complete

4 / 5