Loading learning content...
Some SQL queries hide devastating performance characteristics in plain sight. They look reasonable, return correct results, and execute instantly on small datasets. But as data grows, they slow down not linearly—but quadratically.
The hidden culprit is often a correlated subquery: a subquery that references columns from the outer query and must be re-executed for every row the outer query examines.
Consider this innocent-looking query:
SELECT e.employee_id, e.name, e.salary,
(SELECT AVG(salary) FROM employees e2 WHERE e2.department_id = e.department_id)
FROM employees e;
What you expect: The database calculates department averages and returns them with employee data.
What actually happens: For each of the 100,000 employees, the database runs a separate query against the employees table to compute the department average. That's 100,000 subquery executions, each potentially scanning thousands of rows.
The complexity explosion:
By the end of this page, you will understand exactly how correlated subqueries execute, recognize patterns that cause performance problems, master rewrite techniques using JOINs and window functions, and know when (rarely) correlated subqueries are acceptable.
Before addressing performance issues, we must clearly distinguish between correlated and non-correlated subqueries, as they have fundamentally different execution characteristics.
Non-Correlated (Independent) Subquery:
A non-correlated subquery doesn't reference the outer query. It can be executed once, independently, and its result reused:
-- Non-correlated: subquery is independent
SELECT * FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- No reference to outer query's employees table
Execution: The inner query runs once, returns a single value (e.g., 75000), and then the outer query becomes WHERE salary > 75000—a simple, efficient scan or index seek.
Correlated (Dependent) Subquery:
A correlated subquery references columns from the outer query, creating a dependency:
-- Correlated: subquery references outer query
SELECT * FROM employees e
WHERE salary > (SELECT AVG(salary) FROM employees e2
WHERE e2.department_id = e.department_id);
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- References e.department_id from outer query
Execution: For each row examined by the outer query, the inner query must run with that row's department_id. If the outer query examines 100,000 rows, the inner query runs 100,000 times.
Correlated subqueries are syntactically compact and logically intuitive—which makes them attractive to write. But this readability comes at a severe performance cost. The database cannot cache or reuse the subquery result; it must re-evaluate for every row.
Understanding the execution model of correlated subqueries reveals why they cause performance problems.
The Nested Loop Execution Model:
At a conceptual level, a correlated subquery executes like a nested loop in procedural code:
12345678910111213141516171819202122232425262728293031
-- SQL with correlated subquerySELECT e.employee_id, e.name, (SELECT AVG(o.amount) FROM orders o WHERE o.employee_id = e.employee_id) AS avg_orderFROM employees eWHERE e.status = 'active'; -- Conceptual execution (pseudocode)result = []FOR each employee e IN (SELECT * FROM employees WHERE status = 'active'): # For every single row in the outer result... subquery_result = EXECUTE( SELECT AVG(amount) FROM orders WHERE employee_id = e.employee_id ) # ...we run a complete query against the orders table result.append(e.employee_id, e.name, subquery_result) RETURN result # If we have:# - 10,000 active employees# - Orders table with 5,000,000 rows# # Execution:# - Outer loop: 10,000 iterations# - Inner query: Each may scan orders table (unless indexed)# - Best case (indexed): 10,000 index seeks# - Worst case (unindexed): 10,000 full table scans = 50 billion row readsExecution Plan Representation:
In actual execution plans, correlated subqueries appear as nested loop joins or specific operators:
| Database | Operator Name | Description |
|---|---|---|
| SQL Server | Nested Loops (with index spool) | Inner query re-executed; may cache via spool |
| PostgreSQL | SubPlan | Subquery executed per outer row |
| MySQL | DEPENDENT SUBQUERY | Explicitly marked in EXPLAIN |
| Oracle | NESTED LOOPS or FILTER | Subquery in nested execution context |
The Quadratic Growth Pattern:
The key insight is that complexity is multiplicative, not additive:
Total work ≈ (rows from outer query) × (work per subquery execution)
If the outer query returns 10,000 rows and each subquery execution examines 5,000 rows:
This 3,300x difference explains why queries that work fine in development (100 rows) become unusable in production (100,000 rows): 100 × 100 = 10,000, but 100,000 × 100,000 = 10,000,000,000.
Modern query optimizers can sometimes 'decorrelate' subqueries, automatically rewriting them as joins. However, this optimization isn't always possible, and different databases have varying capabilities. Don't rely on the optimizer—write efficient queries explicitly.
Correlated subqueries appear in several common patterns. Recognizing these patterns is the first step toward optimization.
Pattern 1: Scalar Subquery in SELECT (Per-Row Computation)
12345678910111213141516171819202122232425
-- PROBLEM: Scalar subquery for each rowSELECT c.customer_id, c.name, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count, (SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.customer_id) AS last_order, (SELECT SUM(total) FROM orders o WHERE o.customer_id = c.customer_id) AS total_spentFROM customers c; -- For 50,000 customers, this runs 150,000 subqueries (3 per row)!-- Each subquery may scan or seek in the orders table -- SOLUTION: Single JOIN with aggregationSELECT c.customer_id, c.name, COUNT(o.order_id) AS order_count, MAX(o.order_date) AS last_order, SUM(o.total) AS total_spentFROM customers cLEFT JOIN orders o ON c.customer_id = o.customer_idGROUP BY c.customer_id, c.name; -- One scan of customers, one scan of orders, one join operation-- Complexity: O(n + m) instead of O(n × m)Pattern 2: Existence Check in WHERE (EXISTS/NOT EXISTS)
12345678910111213141516171819202122232425262728293031323334353637383940414243
-- PROBLEM: EXISTS with correlated subquerySELECT c.customer_id, c.nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date >= '2024-01-01'); -- For each customer, check if matching orders exist-- With proper indexing on orders(customer_id, order_date), -- EXISTS can short-circuit (stop at first match)-- This is often acceptable, but alternatives may be faster -- ALTERNATIVE 1: Semi-join (if database supports; often equivalent)SELECT DISTINCT c.customer_id, c.nameFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01'; -- ALTERNATIVE 2: IN with non-correlated subquerySELECT c.customer_id, c.nameFROM customers cWHERE c.customer_id IN ( SELECT DISTINCT o.customer_id FROM orders o WHERE o.order_date >= '2024-01-01');-- Inner query runs once, returns set; outer query filters against set -- NOT EXISTS is similar but for anti-joins:-- PROBLEM: Find customers with no ordersSELECT c.customer_id, c.nameFROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id); -- SOLUTION: LEFT JOIN with NULL checkSELECT c.customer_id, c.nameFROM customers cLEFT JOIN orders o ON c.customer_id = o.customer_idWHERE o.order_id IS NULL;Pattern 3: Comparison with Aggregate (Per-Group Comparison)
123456789101112131415161718192021222324252627282930313233343536373839404142
-- PROBLEM: Find employees earning more than their department averageSELECT e.employee_id, e.name, e.salary, e.department_idFROM employees eWHERE e.salary > ( SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department_id = e.department_id); -- For each of 100,000 employees, compute department average-- If 100 departments with 1,000 employees each:-- 100,000 × 1,000 = 100,000,000 row examinations! -- SOLUTION: Window function (calculate average once per partition)SELECT employee_id, name, salary, department_idFROM ( SELECT e.employee_id, e.name, e.salary, e.department_id, AVG(e.salary) OVER (PARTITION BY e.department_id) AS dept_avg FROM employees e) subWHERE salary > dept_avg; -- Single pass through employees table-- Window function computes aggregate per department partition-- Complexity: O(n) instead of O(n × m) -- ALTERNATIVE: JOIN with pre-computed aggregates (CTE)WITH dept_averages AS ( SELECT department_id, AVG(salary) AS avg_salary FROM employees GROUP BY department_id)SELECT e.employee_id, e.name, e.salary, e.department_idFROM employees eJOIN dept_averages da ON e.department_id = da.department_idWHERE e.salary > da.avg_salary; -- Two passes: one for aggregate, one for join/filterPattern 4: Row Limiting with Correlated Logic (Top-N Per Group)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- PROBLEM: Find the most recent order for each customerSELECT o.*FROM orders oWHERE o.order_date = ( SELECT MAX(o2.order_date) FROM orders o2 WHERE o2.customer_id = o.customer_id); -- For each order, re-scan orders to find maximum date for that customer-- Extremely expensive for large tables -- SOLUTION 1: Window function with ROW_NUMBER()SELECT order_id, customer_id, order_date, totalFROM ( SELECT o.*, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) AS rn FROM orders o) rankedWHERE rn = 1; -- Single pass with window function ranking-- Much more efficient -- SOLUTION 2: DISTINCT ON (PostgreSQL-specific)SELECT DISTINCT ON (customer_id) *FROM ordersORDER BY customer_id, order_date DESC; -- SOLUTION 3: LATERAL JOIN (PostgreSQL, newer SQL Server)SELECT c.customer_id, c.name, lo.order_id, lo.order_dateFROM customers cCROSS JOIN LATERAL ( SELECT o.order_id, o.order_date FROM orders o WHERE o.customer_id = c.customer_id ORDER BY o.order_date DESC LIMIT 1) lo; -- LATERAL is still correlated but with explicit row limiting-- Can be efficient with proper indexingLook for these SQL patterns: (1) Subquery references a table alias from the outer query, (2) Subquery in SELECT clause that isn't a constant, (3) EXISTS/NOT EXISTS with inner/outer table correlation, (4) Scalar comparison against an aggregate that varies per row.
The most common optimization for correlated subqueries is rewriting with JOINs. This transforms the per-row execution model into a set-based operation that the optimizer can execute more efficiently.
The General Transformation Pattern:
Correlated Subquery: JOIN Equivalent:
───────────────────── ────────────────
SELECT outer.*, (subquery) → SELECT outer.*, agg.*
FROM outer FROM outer
WHERE condition with subquery JOIN aggregated_subquery agg
ON outer.key = agg.key
WHERE condition
Step-by-Step Transformation:
12345678910111213141516171819202122232425262728293031323334353637383940414243
-- ORIGINAL: Correlated scalar subquerySELECT p.product_id, p.name, p.price, (SELECT COUNT(*) FROM order_items oi WHERE oi.product_id = p.product_id) AS times_ordered, (SELECT SUM(quantity) FROM order_items oi WHERE oi.product_id = p.product_id) AS total_quantityFROM products p; -- STEP 1: Identify the subquery aggregations-- - COUNT(*) grouped by product_id-- - SUM(quantity) grouped by product_id -- STEP 2: Create the aggregated subquery as a derived table/CTEWITH product_stats AS ( SELECT oi.product_id, COUNT(*) AS times_ordered, SUM(quantity) AS total_quantity FROM order_items oi GROUP BY oi.product_id) -- STEP 3: JOIN to the outer querySELECT p.product_id, p.name, p.price, COALESCE(ps.times_ordered, 0) AS times_ordered, COALESCE(ps.total_quantity, 0) AS total_quantityFROM products pLEFT JOIN product_stats ps ON p.product_id = ps.product_id; -- LEFT JOIN ensures products with no orders are included (with 0 values)-- COALESCE handles NULL from unmatched LEFT JOIN -- PERFORMANCE COMPARISON:-- Correlated: 10,000 products × 500,000 order_items scans (per product)-- JOIN: 1 scan of products + 1 scan of order_items + join operationEXISTS to JOIN Transformation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- ORIGINAL: EXISTS correlated subquerySELECT c.*FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.status = 'completed' AND o.total > 1000); -- REWRITE: Semi-join with DISTINCTSELECT DISTINCT c.*FROM customers cINNER JOIN orders o ON c.customer_id = o.customer_idWHERE o.status = 'completed' AND o.total > 1000; -- OR: Using IN with derived setSELECT c.*FROM customers cWHERE c.customer_id IN ( SELECT DISTINCT o.customer_id FROM orders o WHERE o.status = 'completed' AND o.total > 1000); -- The IN version is non-correlated: inner query runs once -- NOT EXISTS to LEFT JOIN transformation:-- ORIGINAL:SELECT c.*FROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id); -- REWRITE: Anti-join patternSELECT c.*FROM customers cLEFT JOIN orders o ON c.customer_id = o.customer_idWHERE o.order_id IS NULL; -- This is often the most efficient anti-join in many databasesWhen converting correlated subqueries to JOINs, be careful about cardinality. A scalar subquery returns exactly one value per outer row. A JOIN may return multiple rows if the relationship isn't 1:1. Use DISTINCT or ensure your join keys are unique on the joined side, or use aggregation to collapse to one row.
Window functions are often the most elegant and efficient solution for correlated subquery patterns, particularly those involving per-group comparisons or rankings.
Why Window Functions Excel:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
-- ========================================-- PATTERN 1: Compare to group aggregate-- ======================================== -- PROBLEM (correlated):SELECT e.*FROM employees eWHERE e.salary > ( SELECT AVG(salary) FROM employees e2 WHERE e2.department_id = e.department_id); -- SOLUTION (window function):SELECT *FROM ( SELECT e.*, AVG(salary) OVER (PARTITION BY department_id) AS dept_avg FROM employees e) subWHERE salary > dept_avg; -- ========================================-- PATTERN 2: Find max/min row per group-- ======================================== -- PROBLEM (correlated):SELECT o.*FROM orders oWHERE o.order_date = ( SELECT MAX(order_date) FROM orders o2 WHERE o2.customer_id = o.customer_id); -- SOLUTION (window function with ROW_NUMBER):SELECT *FROM ( SELECT o.*, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) AS rn FROM orders o) rankedWHERE rn = 1; -- Use RANK() or DENSE_RANK() if ties should return multiple rows -- ========================================-- PATTERN 3: Running comparisons-- ======================================== -- PROBLEM: Find orders where total exceeds cumulative average-- SOLUTION (window function):SELECT *FROM ( SELECT o.*, AVG(total) OVER ( PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) AS avg_previous_orders FROM orders o) subWHERE total > avg_previous_orders; -- ========================================-- PATTERN 4: Percentage of group total-- ======================================== -- PROBLEM (correlated):SELECT d.department_id, e.employee_id, e.salary, e.salary * 100.0 / ( SELECT SUM(e2.salary) FROM employees e2 WHERE e2.department_id = d.department_id ) AS pct_of_deptFROM departments dJOIN employees e ON d.department_id = e.department_id; -- SOLUTION (window function):SELECT department_id, employee_id, salary, salary * 100.0 / SUM(salary) OVER (PARTITION BY department_id) AS pct_of_deptFROM employees; -- ========================================-- PATTERN 5: Top N per group-- ======================================== -- Find top 3 highest-paid employees per department -- SOLUTION (window function):SELECT *FROM ( SELECT e.*, RANK() OVER ( PARTITION BY department_id ORDER BY salary DESC ) AS salary_rank FROM employees e) rankedWHERE salary_rank <= 3;| Correlated Pattern | Window Function Solution | Key Insight |
|---|---|---|
| Compare to group AVG/SUM/COUNT | AGG() OVER (PARTITION BY group_col) | Aggregate computed once per partition |
| Find row with MAX/MIN per group | ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) = 1 | Ranking finds extreme values |
| Compare to row's peers | LAG()/LEAD() or frame clauses | Access adjacent rows without self-join |
| Running/cumulative calculations | SUM() OVER (ORDER BY ... ROWS ...) | Frame specification controls accumulation |
| Percentage of total | value / SUM() OVER (PARTITION BY ...) | Total computed per partition |
Window functions process the entire result set in one pass while computing per-partition aggregates. Multiple window functions with the same PARTITION BY and ORDER BY can often share the same sort operation, making them extremely efficient for complex analytics.
Identifying problematic correlated subqueries requires both code review and execution plan analysis.
Code-Level Detection:
Look for these patterns in SQL:
= (SELECT ...) or IN (SELECT ...) where the inner SELECT references outer tablesExecution Plan Detection:
12345678910111213141516171819202122232425262728293031323334
-- Enable execution planSET SHOWPLAN_TEXT ON;GOSELECT e.*, (SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id = e.dept_id)FROM employees e;GOSET SHOWPLAN_TEXT OFF; -- Look for in execution plan:-- 1. Nested Loops operator with inner side as subquery-- 2. High "Actual Rows" × "Executions" ratio-- 3. Spools (Table Spool, Index Spool) indicating repeated access -- In graphical plan (SSMS):-- - Thick arrows (indicating high row counts)-- - Nested Loops with subquery on inner side-- - "Number of Executions" equal to outer row count -- Query for plans with high-cost subqueriesSELECT qs.execution_count, qs.total_elapsed_time / 1000000.0 AS total_sec, qs.total_elapsed_time / qs.execution_count / 1000.0 AS avg_ms, SUBSTRING(qt.text, qs.statement_start_offset/2 + 1, (CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(qt.text) ELSE qs.statement_end_offset END - qs.statement_start_offset)/2 + 1) AS query_textFROM sys.dm_exec_query_stats qsCROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qtWHERE qt.text LIKE '%SELECT%SELECT%' -- Nested SELECT (rough pattern)AND qs.total_elapsed_time / qs.execution_count > 1000000 -- >1 sec avgORDER BY avg_ms DESC;In PostgreSQL's EXPLAIN ANALYZE output, the 'loops=N' value is critical. It shows how many times an operation was executed. For a correlated subquery, N often equals the number of rows in the outer query—a clear sign of per-row execution.
While this page emphasizes the performance dangers of correlated subqueries, they're not universally bad. In specific situations, they may be acceptable or even optimal.
Scenarios Where Correlated Subqueries May Be Acceptable:
123456789101112131415161718192021222324252627282930313233343536373839
-- Example 1: EXISTS with good indexing-- Finding active users who have admin roles (rare combination)SELECT u.*FROM users uWHERE u.status = 'active' -- Very selective: 100 active usersAND EXISTS ( SELECT 1 FROM user_roles ur WHERE ur.user_id = u.user_id AND ur.role_name = 'admin' -- Index on user_roles(user_id, role_name) makes this O(1) per user);-- 100 users × O(1) index probe = 100 index probes: acceptable -- Example 2: LATERAL for top-N per group (PostgreSQL, SQL Server as CROSS APPLY)SELECT c.customer_id, c.name, recent_orders.*FROM customers cCROSS JOIN LATERAL ( SELECT o.order_id, o.order_date, o.total FROM orders o WHERE o.customer_id = c.customer_id ORDER BY o.order_date DESC LIMIT 3) recent_orders;-- LATERAL is correlated but allows LIMIT within the correlation-- With index on orders(customer_id, order_date DESC), this is efficient -- Example 3: Optimizer-decorrelated query-- Some databases automatically rewrite this:SELECT e.*FROM employees eWHERE e.salary > ( SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id); -- Check the execution plan:-- If it shows Hash Join or Merge Join instead of Nested Loops,-- the optimizer decorrelated the query. -- In that case, the correlated syntax has no performance penalty.Never assume a correlated subquery will be optimized away. Always check the execution plan. Look for the number of loops/executions, the join type (Nested Loops suggests correlation; Hash/Merge Join suggests decorrelation), and the total work performed.
Correlated subqueries are a common source of catastrophic query performance problems. Their innocent syntax hides quadratic execution complexity that only manifests at production scale.
You now understand why correlated subqueries cause performance problems and how to rewrite them using JOINs and window functions. Next, we'll examine the notorious N+1 query problem—a related pattern that plagues object-relational mapping and application code.