Database Management SystemsSQL Joins & Subqueries

Correlated Subqueries

LevelIntermediate

Duration60 mins

TopicSQL Joins & Subqueries

5 / 5

Performance Considerations

When Correlated Subqueries Become Costly

Correlated subqueries carry a reputation for poor performance—sometimes deserved, often not. The conceptual model of "execute subquery once per outer row" conjures images of exponential slowdowns. But reality is more nuanced.

Modern database optimizers are remarkably sophisticated at transforming, decorrelating, and optimizing subqueries. A correlated subquery that looks expensive may execute as efficiently as a carefully crafted join. Conversely, a seemingly innocent correlated query may indeed cause performance disasters when:

The optimizer cannot apply transformations
Missing indexes force repeated full table scans
The query processes millions of rows
Complex joins inside the subquery resist optimization

This page equips you to understand, analyze, and optimize correlated subquery performance—whether writing new queries or diagnosing production issues.

What You Will Learn

By the end of this page, you will understand when correlated subqueries cause performance problems, how to analyze execution plans, indexing strategies for optimization, techniques for query rewriting, and when to accept the performance characteristics of correlated approaches.

Understanding the Performance Model

To reason about correlated subquery performance, you need to understand both the theoretical model and how optimizers modify it.

Theoretical Complexity:

For a correlated subquery without optimization:

Outer query processes N rows
For each outer row, subquery processes M rows (potentially filtered)
Total work: approximately N × M operations

If N = 100,000 and M = 100,000 (no filtering), you're looking at 10 billion operations—clearly problematic.

What Reduces Actual Cost:

Performance Optimizations

•Decorrelation — Optimizer transforms correlated subquery into a join, computing aggregates once rather than per-row. This is the most impactful optimization.
•Index utilization — Indexes on correlation columns reduce per-execution cost from O(M) to O(log M) or O(1). A 100,000× speedup is possible.
•Result caching — If correlation values repeat, engines may cache subquery results. 1M rows with 10 distinct departments = 10 actual subquery executions.
•Early termination — EXISTS stops at first match. If matches are common, execution is nearly instant per outer row.
•Predicate pushdown — Optimizer pushes outer query filters down, reducing N before correlation begins.

complexity_scenarios.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- SCENARIO 1: Small, well-indexed
-- Outer: 1,000 customers
-- Inner: Orders indexed on customer_id
-- Each subquery: O(log n) index lookup
-- Total: ~1,000 × log(orders) = very fast
 
SELECT c.name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id  -- Indexed!
);
 
 
-- SCENARIO 2: Large, no index (DANGER)
-- Outer: 100,000 products
-- Inner: 1,000,000 reviews, no index on product_id
-- Each subquery: Full scan of 1M rows
-- Total: 100,000 × 1,000,000 = 100 billion operations
 
SELECT p.name FROM products p
WHERE p.price > (
    SELECT AVG(r.rating) FROM reviews r  -- No index!
    WHERE r.product_id = p.product_id
);
 
 
-- SCENARIO 3: Decorrelated by optimizer
-- Optimizer recognizes pattern and transforms to:
SELECT p.name FROM products p
JOIN (
    SELECT product_id, AVG(rating) as avg_rating
    FROM reviews
    GROUP BY product_id
) r ON r.product_id = p.product_id
WHERE p.price > r.avg_rating;
-- Now: Single scan of reviews + hash join = O(n + m)

Analyzing Execution Plans for Correlated Subqueries

The execution plan reveals how the database actually runs your query. Learning to read plans for correlated subqueries is essential for performance work.

Key Things to Look For:

Execution Plan Indicators
Plan Element	Indicates	Performance Implication
SubPlan / Subquery Scan	Subquery executing as-written	Potential N×M if not optimized
Hash Semi Join / Anti Join	EXISTS/NOT EXISTS optimized	Efficient—executed as join
Nested Loop + Index Lookup	Per-row with index	Acceptable for small outer sets
Hash Join / Merge Join	Decorrelated to join	Efficient—computed once
Seq Scan in subquery	No index being used	Red flag for large tables
loops=N (high number)	Subquery ran N times	Problem if N is large

explain_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- PostgreSQL: Use EXPLAIN ANALYZE for actual execution stats
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.dept_id = e.dept_id
);
 
-- Sample output analysis:
/*
Seq Scan on employees e (cost=0.00..2501.00 rows=333 loops=1)
  Filter: (salary > (SubPlan 1))
  Rows Removed by Filter: 667
  SubPlan 1
    ->  Aggregate (cost=24.50..24.51 rows=1 loops=1000)
                                               ^^^^^^^^^
                                               RED FLAG: 1000 loops!
          ->  Seq Scan on employees e2
                Filter: (dept_id = e.dept_id)
                
⚠️ "loops=1000" means subquery executed 1000 times
⚠️ "Seq Scan on employees e2" means full table scan each time
*/
 
 
-- Compare with decorrelated version:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT e.name, e.salary
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) as avg_sal
    FROM employees GROUP BY dept_id
) d ON d.dept_id = e.dept_id
WHERE e.salary > d.avg_sal;
 
/*
Hash Join (cost=...) (loops=1)
                       ^^^^^^^
                       Single execution!
  -> Seq Scan on employees
  -> Hash (Subquery Scan on employees)
      -> HashAggregate (GROUP BY dept_id)
*/

loops= is Your Friend

In PostgreSQL EXPLAIN output, watch the 'loops' count. loops=1 for operations inside a subquery means decorrelation succeeded. loops=10000 means I trouble—the subquery ran 10,000 times. MySQL and SQL Server have equivalent indicators in their plan formats.

Indexing Strategies for Correlated Subqueries

Proper indexing can transform a catastrophically slow correlated subquery into an efficient query. The key is indexing the correlation columns—the columns that link inner to outer query.

indexing_correlated.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- THE QUERY:
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id  -- Correlation column
    AND o.order_date >= '2024-01-01'     -- Additional filter
);
 
-- REQUIRED INDEX:
-- Index on the correlation column(s) in the INNER table
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
 
-- BETTER INDEX:
-- Covering index includes the filter column
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
 
-- With this index:
-- For each customer, the EXISTS check uses index seek → O(log n)
-- Without index: Full table scan → O(n)
 
 
-- AGGREGATE CORRELATION:
SELECT e.name
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.dept_id = e.dept_id  -- Correlation column
);
 
-- INDEX FOR AGGREGATE:
CREATE INDEX idx_employees_dept_id ON employees(dept_id);
 
-- Even better (covering index with salary for AVG):
CREATE INDEX idx_employees_dept_salary ON employees(dept_id, salary);
 
-- This lets the subquery compute AVG from index alone (index-only scan)

Indexing Checklist

•Identify correlation predicates — Find WHERE clauses in the subquery that reference outer query tables.
•Index inner table correlation columns — The columns FROM the inner table used in correlation need indexes (e.g., orders.customer_id, not customers.customer_id).
•Consider composite indexes — If subquery has additional filters (e.g., date ranges), include those in the index after the correlation column.
•Think about covering indexes — Include columns needed for aggregation (for AVG, include the value column) to enable index-only scans.
•Check execution plan after indexing — Verify the index is actually used; sometimes statistics or query structure prevent index usage.

Index the Inner Table

A common mistake is indexing the correlation column on the outer table instead of the inner table. For 'WHERE inner.fk = outer.pk', you need an index on inner.fk, not outer.pk. The inner table is what gets searched repeatedly.

Query Rewriting for Performance

When the optimizer doesn't decorrelate automatically, or when you need explicit control, manual query rewriting provides alternatives. These transformations preserve semantics while changing execution characteristics.

Manual Decorrelation to JOIN:

Compute the correlated values once, then join.

rewrite_to_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- ORIGINAL (correlated):
SELECT p.product_id, p.name, p.price
FROM products p
WHERE p.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category_id = p.category_id
);
 
-- REWRITTEN (decorrelated join):
WITH category_avgs AS (
    SELECT category_id, AVG(price) as avg_price
    FROM products
    GROUP BY category_id
)
SELECT p.product_id, p.name, p.price
FROM products p
JOIN category_avgs ca ON ca.category_id = p.category_id
WHERE p.price > ca.avg_price;
 
-- Benefits:
-- • AVG computed once per category
-- • Single pass through products
-- • Clear, readable structure
-- • Often faster execution plan

Performance Anti-Patterns

Certain patterns consistently cause performance problems with correlated subqueries. Recognizing and avoiding these anti-patterns prevents common performance disasters.

anti_patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
-- ANTI-PATTERN 1: Multiple correlated subqueries computing similar things
-- ❌ BAD: Three separate subqueries, three passes
SELECT 
    c.name,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id),
    (SELECT SUM(amount) FROM orders o WHERE o.customer_id = c.customer_id),
    (SELECT AVG(amount) FROM orders o WHERE o.customer_id = c.customer_id)
FROM customers c;
 
-- ✓ GOOD: Single aggregation
SELECT c.name, stats.cnt, stats.total, stats.avg_amount
FROM customers c
LEFT JOIN LATERAL (
    SELECT COUNT(*) cnt, SUM(amount) total, AVG(amount) avg_amount
    FROM orders o WHERE o.customer_id = c.customer_id
) stats ON true;
 
 
-- ANTI-PATTERN 2: Correlated subquery on unindexed column
-- ❌ BAD: Full scan for every outer row
SELECT p.product_name
FROM products p
WHERE p.price > (
    SELECT AVG(r.sentiment_score)  -- No index on product_id!
    FROM reviews r
    WHERE r.product_id = p.product_id
);
 
-- ✓ FIX: Add index
CREATE INDEX idx_reviews_product ON reviews(product_id);
-- Or rewrite to JOIN with pre-computed aggregates
 
 
-- ANTI-PATTERN 3: Correlated subquery with JOIN inside
-- ❌ BAD: Complex join executed per-row
SELECT c.name
FROM customers c
WHERE (
    SELECT SUM(oi.quantity * oi.price)
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    JOIN products p ON p.product_id = oi.product_id
    WHERE o.customer_id = c.customer_id
    AND p.category = 'Electronics'
) > 1000;
 
-- ✓ GOOD: Pre-compute the aggregation
WITH customer_electronics_spend AS (
    SELECT o.customer_id, SUM(oi.quantity * oi.price) as total
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id  
    JOIN products p ON p.product_id = oi.product_id
    WHERE p.category = 'Electronics'
    GROUP BY o.customer_id
)
SELECT c.name
FROM customers c
JOIN customer_electronics_spend ces ON ces.customer_id = c.customer_id
WHERE ces.total > 1000;
 
 
-- ANTI-PATTERN 4: UsingFunction in correlation predicate
-- ❌ BAD: Function prevents index usage
SELECT c.name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE UPPER(o.customer_code) = UPPER(c.code)  -- No index use!
);
 
-- ✓ GOOD: Use consistent case in data, or functional index
CREATE INDEX idx_orders_customer_code_upper ON orders(UPPER(customer_code));

The Biggest Anti-Pattern

The worst anti-pattern: writing correlated subqueries with complex JOINs on large, unindexed tables without checking execution plans. Always EXPLAIN ANALYZE before deploying correlated subqueries on production-scale data.

When Correlated Subqueries Are Actually Fine

Not all correlated subqueries need optimization. In many cases, they perform excellently without intervention. Understanding when to accept correlated performance prevents premature optimization.

Correlated Subqueries Are Often Fine When

•Outer query returns few rows — If WHERE clause on outer query is highly selective (returns 100 rows from 1M), per-row subquery execution is trivial.
•Subquery has indexed lookup — With index on correlation columns, each subquery execution is O(log n) or O(1). 1000 × O(log n) is fast.
•EXISTS with early termination — EXISTS stops at first match. If matches are common, most subquery executions are nearly instant.
•Optimizer successfully decorrelates — Check EXPLAIN. If you see Hash Join or Merge Join instead of SubPlan, the optimizer did the work for you.
•Correlation values have low cardinality — 1M rows but only 50 distinct departments means subquery values are cached/reused.
•Query runs infrequently — Ad-hoc analytics or nightly batch jobs have different performance requirements than online queries.

acceptable_correlated.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- ACCEPTABLE: EXISTS with good index, common matches
SELECT c.customer_id, c.name
FROM customers c
WHERE c.signup_date >= '2024-01-01'  -- Selective outer filter
AND EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id  -- Indexed
);
-- New customers (small set) + indexed EXISTS = fast
 
 
-- ACCEPTABLE: Scalar subquery with cached results
SELECT e.name, e.salary,
    (SELECT d.dept_name FROM departments d 
     WHERE d.dept_id = e.dept_id) as department
FROM employees e;
-- 1000 employees, 10 departments = 10 unique lookups (cached)
 
 
-- ACCEPTABLE: Small outer set, any subquery complexity
SELECT c.name,
    (SELECT SUM(total) FROM orders WHERE customer_id = c.customer_id)
FROM customers c
WHERE c.customer_id IN (101, 102, 103);  -- Only 3 customers!
-- 3 subquery executions, complexity doesn't matter
 
 
-- CHECK BEFORE OPTIMIZING:
EXPLAIN ANALYZE
SELECT ...  -- your correlated query
 
-- If total execution time is acceptable (e.g., < 100ms for online,
-- < 1 minute for batch), optimization effort may not be worthwhile

Monitoring and Continuous Improvement

Performance of correlated subqueries can change over time as data grows or distributions shift. Continuous monitoring ensures queries remain performant.

Monitoring Best Practices

•Log slow queries — Configure slow query logging with a reasonable threshold. Correlated subqueries often appear here as data grows.
•Track query plan changes — Optimizer decisions change with statistics updates. A decorrelated query today might not be tomorrow.
•Monitor table growth — Subqueries that worked fine at 100K rows may struggle at 10M. Track table sizes.
•Test with production-like data — Development databases often have tiny datasets. Always test correlated subquery performance with realistic data volumes.
•Review after schema changes — Adding/removing indexes affects optimizer choices. Re-check correlated subquery plans after schema changes.
•Set up alerts — For critical queries, set up monitoring alerts for execution time degradation.

monitoring_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- PostgreSQL: Find slow queries with subplans
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
WHERE query ILIKE '%EXISTS%' OR query ILIKE '%SELECT%SELECT%'
ORDER BY mean_time DESC
LIMIT 20;
 
 
-- MySQL: Performance Schema for subquery analysis
SELECT DIGEST_TEXT, COUNT_STAR, AVG_TIMER_WAIT/1000000000 as avg_ms
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT LIKE '%EXISTS%'
ORDER BY AVG_TIMER_WAIT DESC
LIMIT 20;
 
 
-- Create a baseline for critical queries
-- Store execution plans and times, compare periodically
CREATE TABLE query_performance_log (
    query_id VARCHAR(100),
    captured_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    execution_time_ms NUMERIC,
    plan_hash VARCHAR(64),
    row_estimate INTEGER,
    actual_rows INTEGER
);
 
-- Log performance periodically
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT ...;  -- Critical correlated query
 
-- Compare plan_hash over time to detect plan regressions

Data Growth Is the Enemy

Correlated subquery performance often degrades nonlinearly with data growth. A query that takes 100ms at 100K rows might take 10 seconds at 1M rows and 10 minutes at 10M rows. Plan for growth.

Summary: Performance Considerations

We've comprehensively covered performance aspects of correlated subqueries. Here are the essential takeaways:

Key Takeaways

•Understand the model — Theoretical N×M execution is often avoided through optimizer transformations, indexing, and caching.
•Always check execution plans — EXPLAIN ANALYZE reveals actual behavior. Look for 'loops' count and SubPlan vs Join operators.
•Index correlation columns — The single most impactful optimization. Index the inner table's columns used in correlation predicates.
•Rewrite when necessary — Transform to JOINs, window functions, or LATERAL when optimizer doesn't decorrelate automatically.
•Avoid anti-patterns — Multiple similar subqueries, unindexed columns, complex joins inside subqueries, and functions on correlation columns.
•Know when to accept — Small outer sets, indexed lookups, successful decorrelation, and infrequent batch queries often don't need optimization.

Module Complete! You've now mastered correlated subqueries—from the fundamental concept of correlation, through EXISTS and NOT EXISTS operators, the comparison with non-correlated alternatives, and finally performance optimization. These skills enable you to write sophisticated, efficient SQL queries that leverage one of the most powerful features of the SQL language.

Correlated Subqueries Mastered

You now have comprehensive knowledge of correlated subquery performance—how to analyze it, optimize it, and know when optimization is unnecessary. Combined with the previous pages, you have complete command over correlated subqueries in SQL.

5 / 5

Loading learning content...

Database Management SystemsSQL Joins & Subqueries

Correlated Subqueries

LevelIntermediate

Duration60 mins

TopicSQL Joins & Subqueries

5 / 5

Performance Considerations

When Correlated Subqueries Become Costly

The optimizer cannot apply transformations
Missing indexes force repeated full table scans
The query processes millions of rows
Complex joins inside the subquery resist optimization

This page equips you to understand, analyze, and optimize correlated subquery performance—whether writing new queries or diagnosing production issues.

What You Will Learn

Understanding the Performance Model

To reason about correlated subquery performance, you need to understand both the theoretical model and how optimizers modify it.

Theoretical Complexity:

For a correlated subquery without optimization:

Outer query processes N rows
For each outer row, subquery processes M rows (potentially filtered)
Total work: approximately N × M operations

If N = 100,000 and M = 100,000 (no filtering), you're looking at 10 billion operations—clearly problematic.

What Reduces Actual Cost:

Performance Optimizations

•Decorrelation — Optimizer transforms correlated subquery into a join, computing aggregates once rather than per-row. This is the most impactful optimization.
•Index utilization — Indexes on correlation columns reduce per-execution cost from O(M) to O(log M) or O(1). A 100,000× speedup is possible.
•Result caching — If correlation values repeat, engines may cache subquery results. 1M rows with 10 distinct departments = 10 actual subquery executions.
•Early termination — EXISTS stops at first match. If matches are common, execution is nearly instant per outer row.
•Predicate pushdown — Optimizer pushes outer query filters down, reducing N before correlation begins.

complexity_scenarios.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- SCENARIO 1: Small, well-indexed
-- Outer: 1,000 customers
-- Inner: Orders indexed on customer_id
-- Each subquery: O(log n) index lookup
-- Total: ~1,000 × log(orders) = very fast
 
SELECT c.name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id  -- Indexed!
);
 
 
-- SCENARIO 2: Large, no index (DANGER)
-- Outer: 100,000 products
-- Inner: 1,000,000 reviews, no index on product_id
-- Each subquery: Full scan of 1M rows
-- Total: 100,000 × 1,000,000 = 100 billion operations
 
SELECT p.name FROM products p
WHERE p.price > (
    SELECT AVG(r.rating) FROM reviews r  -- No index!
    WHERE r.product_id = p.product_id
);
 
 
-- SCENARIO 3: Decorrelated by optimizer
-- Optimizer recognizes pattern and transforms to:
SELECT p.name FROM products p
JOIN (
    SELECT product_id, AVG(rating) as avg_rating
    FROM reviews
    GROUP BY product_id
) r ON r.product_id = p.product_id
WHERE p.price > r.avg_rating;
-- Now: Single scan of reviews + hash join = O(n + m)

Analyzing Execution Plans for Correlated Subqueries

The execution plan reveals how the database actually runs your query. Learning to read plans for correlated subqueries is essential for performance work.

Key Things to Look For:

Execution Plan Indicators
Plan Element	Indicates	Performance Implication
SubPlan / Subquery Scan	Subquery executing as-written	Potential N×M if not optimized
Hash Semi Join / Anti Join	EXISTS/NOT EXISTS optimized	Efficient—executed as join
Nested Loop + Index Lookup	Per-row with index	Acceptable for small outer sets
Hash Join / Merge Join	Decorrelated to join	Efficient—computed once
Seq Scan in subquery	No index being used	Red flag for large tables
loops=N (high number)	Subquery ran N times	Problem if N is large

explain_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- PostgreSQL: Use EXPLAIN ANALYZE for actual execution stats
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.dept_id = e.dept_id
);
 
-- Sample output analysis:
/*
Seq Scan on employees e (cost=0.00..2501.00 rows=333 loops=1)
  Filter: (salary > (SubPlan 1))
  Rows Removed by Filter: 667
  SubPlan 1
    ->  Aggregate (cost=24.50..24.51 rows=1 loops=1000)
                                               ^^^^^^^^^
                                               RED FLAG: 1000 loops!
          ->  Seq Scan on employees e2
                Filter: (dept_id = e.dept_id)
                
⚠️ "loops=1000" means subquery executed 1000 times
⚠️ "Seq Scan on employees e2" means full table scan each time
*/
 
 
-- Compare with decorrelated version:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT e.name, e.salary
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) as avg_sal
    FROM employees GROUP BY dept_id
) d ON d.dept_id = e.dept_id
WHERE e.salary > d.avg_sal;
 
/*
Hash Join (cost=...) (loops=1)
                       ^^^^^^^
                       Single execution!
  -> Seq Scan on employees
  -> Hash (Subquery Scan on employees)
      -> HashAggregate (GROUP BY dept_id)
*/

loops= is Your Friend

Indexing Strategies for Correlated Subqueries

Proper indexing can transform a catastrophically slow correlated subquery into an efficient query. The key is indexing the correlation columns—the columns that link inner to outer query.

indexing_correlated.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- THE QUERY:
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id  -- Correlation column
    AND o.order_date >= '2024-01-01'     -- Additional filter
);
 
-- REQUIRED INDEX:
-- Index on the correlation column(s) in the INNER table
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
 
-- BETTER INDEX:
-- Covering index includes the filter column
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
 
-- With this index:
-- For each customer, the EXISTS check uses index seek → O(log n)
-- Without index: Full table scan → O(n)
 
 
-- AGGREGATE CORRELATION:
SELECT e.name
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.dept_id = e.dept_id  -- Correlation column
);
 
-- INDEX FOR AGGREGATE:
CREATE INDEX idx_employees_dept_id ON employees(dept_id);
 
-- Even better (covering index with salary for AVG):
CREATE INDEX idx_employees_dept_salary ON employees(dept_id, salary);
 
-- This lets the subquery compute AVG from index alone (index-only scan)

Indexing Checklist

•Identify correlation predicates — Find WHERE clauses in the subquery that reference outer query tables.
•Index inner table correlation columns — The columns FROM the inner table used in correlation need indexes (e.g., orders.customer_id, not customers.customer_id).
•Consider composite indexes — If subquery has additional filters (e.g., date ranges), include those in the index after the correlation column.
•Think about covering indexes — Include columns needed for aggregation (for AVG, include the value column) to enable index-only scans.
•Check execution plan after indexing — Verify the index is actually used; sometimes statistics or query structure prevent index usage.

Index the Inner Table

Query Rewriting for Performance

Manual Decorrelation to JOIN:

Compute the correlated values once, then join.

rewrite_to_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- ORIGINAL (correlated):
SELECT p.product_id, p.name, p.price
FROM products p
WHERE p.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category_id = p.category_id
);
 
-- REWRITTEN (decorrelated join):
WITH category_avgs AS (
    SELECT category_id, AVG(price) as avg_price
    FROM products
    GROUP BY category_id
)
SELECT p.product_id, p.name, p.price
FROM products p
JOIN category_avgs ca ON ca.category_id = p.category_id
WHERE p.price > ca.avg_price;
 
-- Benefits:
-- • AVG computed once per category
-- • Single pass through products
-- • Clear, readable structure
-- • Often faster execution plan

Performance Anti-Patterns

Certain patterns consistently cause performance problems with correlated subqueries. Recognizing and avoiding these anti-patterns prevents common performance disasters.

anti_patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
-- ANTI-PATTERN 1: Multiple correlated subqueries computing similar things
-- ❌ BAD: Three separate subqueries, three passes
SELECT 
    c.name,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id),
    (SELECT SUM(amount) FROM orders o WHERE o.customer_id = c.customer_id),
    (SELECT AVG(amount) FROM orders o WHERE o.customer_id = c.customer_id)
FROM customers c;
 
-- ✓ GOOD: Single aggregation
SELECT c.name, stats.cnt, stats.total, stats.avg_amount
FROM customers c
LEFT JOIN LATERAL (
    SELECT COUNT(*) cnt, SUM(amount) total, AVG(amount) avg_amount
    FROM orders o WHERE o.customer_id = c.customer_id
) stats ON true;
 
 
-- ANTI-PATTERN 2: Correlated subquery on unindexed column
-- ❌ BAD: Full scan for every outer row
SELECT p.product_name
FROM products p
WHERE p.price > (
    SELECT AVG(r.sentiment_score)  -- No index on product_id!
    FROM reviews r
    WHERE r.product_id = p.product_id
);
 
-- ✓ FIX: Add index
CREATE INDEX idx_reviews_product ON reviews(product_id);
-- Or rewrite to JOIN with pre-computed aggregates
 
 
-- ANTI-PATTERN 3: Correlated subquery with JOIN inside
-- ❌ BAD: Complex join executed per-row
SELECT c.name
FROM customers c
WHERE (
    SELECT SUM(oi.quantity * oi.price)
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    JOIN products p ON p.product_id = oi.product_id
    WHERE o.customer_id = c.customer_id
    AND p.category = 'Electronics'
) > 1000;
 
-- ✓ GOOD: Pre-compute the aggregation
WITH customer_electronics_spend AS (
    SELECT o.customer_id, SUM(oi.quantity * oi.price) as total
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id  
    JOIN products p ON p.product_id = oi.product_id
    WHERE p.category = 'Electronics'
    GROUP BY o.customer_id
)
SELECT c.name
FROM customers c
JOIN customer_electronics_spend ces ON ces.customer_id = c.customer_id
WHERE ces.total > 1000;
 
 
-- ANTI-PATTERN 4: UsingFunction in correlation predicate
-- ❌ BAD: Function prevents index usage
SELECT c.name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE UPPER(o.customer_code) = UPPER(c.code)  -- No index use!
);
 
-- ✓ GOOD: Use consistent case in data, or functional index
CREATE INDEX idx_orders_customer_code_upper ON orders(UPPER(customer_code));

The Biggest Anti-Pattern

When Correlated Subqueries Are Actually Fine

Not all correlated subqueries need optimization. In many cases, they perform excellently without intervention. Understanding when to accept correlated performance prevents premature optimization.

Correlated Subqueries Are Often Fine When

•Outer query returns few rows — If WHERE clause on outer query is highly selective (returns 100 rows from 1M), per-row subquery execution is trivial.
•Subquery has indexed lookup — With index on correlation columns, each subquery execution is O(log n) or O(1). 1000 × O(log n) is fast.
•EXISTS with early termination — EXISTS stops at first match. If matches are common, most subquery executions are nearly instant.
•Optimizer successfully decorrelates — Check EXPLAIN. If you see Hash Join or Merge Join instead of SubPlan, the optimizer did the work for you.
•Correlation values have low cardinality — 1M rows but only 50 distinct departments means subquery values are cached/reused.
•Query runs infrequently — Ad-hoc analytics or nightly batch jobs have different performance requirements than online queries.

acceptable_correlated.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- ACCEPTABLE: EXISTS with good index, common matches
SELECT c.customer_id, c.name
FROM customers c
WHERE c.signup_date >= '2024-01-01'  -- Selective outer filter
AND EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id  -- Indexed
);
-- New customers (small set) + indexed EXISTS = fast
 
 
-- ACCEPTABLE: Scalar subquery with cached results
SELECT e.name, e.salary,
    (SELECT d.dept_name FROM departments d 
     WHERE d.dept_id = e.dept_id) as department
FROM employees e;
-- 1000 employees, 10 departments = 10 unique lookups (cached)
 
 
-- ACCEPTABLE: Small outer set, any subquery complexity
SELECT c.name,
    (SELECT SUM(total) FROM orders WHERE customer_id = c.customer_id)
FROM customers c
WHERE c.customer_id IN (101, 102, 103);  -- Only 3 customers!
-- 3 subquery executions, complexity doesn't matter
 
 
-- CHECK BEFORE OPTIMIZING:
EXPLAIN ANALYZE
SELECT ...  -- your correlated query
 
-- If total execution time is acceptable (e.g., < 100ms for online,
-- < 1 minute for batch), optimization effort may not be worthwhile

Monitoring and Continuous Improvement

Performance of correlated subqueries can change over time as data grows or distributions shift. Continuous monitoring ensures queries remain performant.

Monitoring Best Practices

•Log slow queries — Configure slow query logging with a reasonable threshold. Correlated subqueries often appear here as data grows.
•Track query plan changes — Optimizer decisions change with statistics updates. A decorrelated query today might not be tomorrow.
•Monitor table growth — Subqueries that worked fine at 100K rows may struggle at 10M. Track table sizes.
•Test with production-like data — Development databases often have tiny datasets. Always test correlated subquery performance with realistic data volumes.
•Review after schema changes — Adding/removing indexes affects optimizer choices. Re-check correlated subquery plans after schema changes.
•Set up alerts — For critical queries, set up monitoring alerts for execution time degradation.

monitoring_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- PostgreSQL: Find slow queries with subplans
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
WHERE query ILIKE '%EXISTS%' OR query ILIKE '%SELECT%SELECT%'
ORDER BY mean_time DESC
LIMIT 20;
 
 
-- MySQL: Performance Schema for subquery analysis
SELECT DIGEST_TEXT, COUNT_STAR, AVG_TIMER_WAIT/1000000000 as avg_ms
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT LIKE '%EXISTS%'
ORDER BY AVG_TIMER_WAIT DESC
LIMIT 20;
 
 
-- Create a baseline for critical queries
-- Store execution plans and times, compare periodically
CREATE TABLE query_performance_log (
    query_id VARCHAR(100),
    captured_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    execution_time_ms NUMERIC,
    plan_hash VARCHAR(64),
    row_estimate INTEGER,
    actual_rows INTEGER
);
 
-- Log performance periodically
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT ...;  -- Critical correlated query
 
-- Compare plan_hash over time to detect plan regressions

Data Growth Is the Enemy

Correlated subquery performance often degrades nonlinearly with data growth. A query that takes 100ms at 100K rows might take 10 seconds at 1M rows and 10 minutes at 10M rows. Plan for growth.

Summary: Performance Considerations

We've comprehensively covered performance aspects of correlated subqueries. Here are the essential takeaways:

Key Takeaways

•Understand the model — Theoretical N×M execution is often avoided through optimizer transformations, indexing, and caching.
•Always check execution plans — EXPLAIN ANALYZE reveals actual behavior. Look for 'loops' count and SubPlan vs Join operators.
•Index correlation columns — The single most impactful optimization. Index the inner table's columns used in correlation predicates.
•Rewrite when necessary — Transform to JOINs, window functions, or LATERAL when optimizer doesn't decorrelate automatically.
•Avoid anti-patterns — Multiple similar subqueries, unindexed columns, complex joins inside subqueries, and functions on correlation columns.
•Know when to accept — Small outer sets, indexed lookups, successful decorrelation, and infrequent batch queries often don't need optimization.

Correlated Subqueries Mastered

5 / 5