Loading learning content...
Subqueries are queries nested within other queries, enabling solutions that would be impossible or extremely awkward with flat SQL alone. They allow you to compute values on-the-fly, filter based on aggregated results, and express complex logical conditions that reference other result sets.
In technical interviews, subqueries are often the key to solving problems that initially seem intractable. More importantly, understanding when to use subqueries versus joins—and recognizing the patterns where each excels—demonstrates the sophisticated SQL thinking that distinguishes exceptional candidates.
By the end of this page, you will master scalar subqueries for single-value computations, table subqueries for derived tables and inline views, correlated subqueries for row-by-row processing, Common Table Expressions (CTEs) for readable modular queries, and recursive CTEs for hierarchical and graph data.
A subquery is a SELECT statement embedded within another SQL statement. Understanding where subqueries can appear and what they return is fundamental to using them effectively.
| Subquery Type | Returns | Usage Location | Correlation |
|---|---|---|---|
| Scalar Subquery | Single value (1 row, 1 column) | SELECT, WHERE, HAVING | Optional |
| Row Subquery | Single row (1 row, multiple columns) | WHERE (with row constructor) | Optional |
| Table Subquery | Multiple rows and columns | FROM, JOIN | Not allowed |
| Correlated Subquery | Any of above, references outer query | SELECT, WHERE, HAVING | Required |
Subquery Execution Conceptual Model:
To understand subquery behavior, it helps to visualize the execution model:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- Scalar subquery in SELECT: Single value per row contextSELECT product_name, unit_price, (SELECT AVG(unit_price) FROM products) AS avg_price, unit_price - (SELECT AVG(unit_price) FROM products) AS diff_from_avgFROM products; -- Scalar subquery in WHERE: Filter based on computed valueSELECT product_name, unit_priceFROM productsWHERE unit_price > (SELECT AVG(unit_price) FROM products); -- Table subquery in FROM: Derived tableSELECT category, avg_price, product_countFROM ( SELECT category_id AS category, AVG(unit_price) AS avg_price, COUNT(*) AS product_count FROM products GROUP BY category_id) AS category_statsWHERE product_count >= 5; -- Subquery with IN: Set membership testSELECT customer_name, emailFROM customersWHERE customer_id IN ( SELECT DISTINCT customer_id FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'); -- Subquery with EXISTS: Existence testSELECT customer_name, emailFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.total > 1000);A scalar subquery must return exactly one value (one row, one column). If it returns no rows, the result is NULL. If it returns multiple rows, an error occurs. This is critical for WHERE and SELECT usage.
A correlated subquery references columns from the outer query, creating a dependency that (conceptually) causes the subquery to execute once for each row of the outer query. This enables powerful patterns but requires careful performance consideration.
Identifying Correlated Subqueries:
A subquery is correlated when it references a column from the outer query that is not defined within the subquery itself.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- Correlated in WHERE: Find products priced above their category averageSELECT p.product_name, p.category_id, p.unit_priceFROM products pWHERE p.unit_price > ( SELECT AVG(p2.unit_price) FROM products p2 WHERE p2.category_id = p.category_id -- Correlation: references outer p); -- Correlated in SELECT: Running countSELECT o.order_id, o.order_date, o.customer_id, ( SELECT COUNT(*) FROM orders o2 WHERE o2.customer_id = o.customer_id -- Correlation AND o2.order_date <= o.order_date ) AS customer_order_numberFROM orders oORDER BY o.customer_id, o.order_date; -- EXISTS with correlation: Customers with high-value ordersSELECT c.customer_id, c.nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id -- Correlation AND o.total > 1000); -- NOT EXISTS with correlation: Customers with no orders this yearSELECT c.customer_id, c.name, c.emailFROM customers cWHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id -- Correlation AND EXTRACT(YEAR FROM o.order_date) = EXTRACT(YEAR FROM CURRENT_DATE)); -- Correlated subquery for top-N per group (without window functions)SELECT p.product_id, p.product_name, p.category_id, p.unit_priceFROM products pWHERE ( SELECT COUNT(*) FROM products p2 WHERE p2.category_id = p.category_id -- Same category AND p2.unit_price > p.unit_price -- Higher price) < 3; -- Fewer than 3 products have higher price = top 3 -- Correlated UPDATE: Update with computed value from related tableUPDATE customers cSET total_orders = ( SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id);Correlated subqueries can be inefficient on large tables as they conceptually execute once per outer row. Modern optimizers often transform them into joins, but when performance matters, consider rewriting as a JOIN or using window functions. Always check EXPLAIN ANALYZE.
Common Interview Pattern: All-Match Condition
Finding rows that match ALL members of a set is a classic interview problem that correlated subqueries solve elegantly:
12345678910111213141516171819202122232425262728293031323334353637383940
-- Find students who have taken ALL required courses-- Required courses: MATH101, COMP101, PHYS101 -- Method 1: Double NOT EXISTS (relational division)SELECT s.student_id, s.nameFROM students sWHERE NOT EXISTS ( -- Find any required course this student hasn't taken SELECT 1 FROM (VALUES ('MATH101'), ('COMP101'), ('PHYS101')) AS required(course_id) WHERE NOT EXISTS ( SELECT 1 FROM enrollments e WHERE e.student_id = s.student_id AND e.course_id = required.course_id )); -- Method 2: COUNT comparisonSELECT s.student_id, s.nameFROM students sJOIN enrollments e ON s.student_id = e.student_idWHERE e.course_id IN ('MATH101', 'COMP101', 'PHYS101')GROUP BY s.student_id, s.nameHAVING COUNT(DISTINCT e.course_id) = 3; -- Must have all 3 -- Find suppliers who supply ALL products in a categorySELECT s.supplier_id, s.supplier_nameFROM suppliers sWHERE NOT EXISTS ( SELECT p.product_id FROM products p WHERE p.category_id = 'CAT-ELECTRONICS' AND NOT EXISTS ( SELECT 1 FROM supplier_products sp WHERE sp.supplier_id = s.supplier_id AND sp.product_id = p.product_id ));Common Table Expressions (CTEs), introduced with the WITH clause, define named temporary result sets that exist only within the scope of a single statement. CTEs dramatically improve query readability and enable step-by-step query construction.
CTE Advantages:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
-- Basic CTE: Named temporary result setWITH high_value_customers AS ( SELECT customer_id, name, SUM(order_total) AS lifetime_value FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.name HAVING SUM(order_total) > 10000)SELECT * FROM high_value_customersORDER BY lifetime_value DESC; -- Multiple CTEs: Step-by-step query constructionWITH customer_orders AS ( SELECT customer_id, COUNT(*) AS order_count, SUM(total) AS total_spending, AVG(total) AS avg_order FROM orders GROUP BY customer_id),customer_segments AS ( SELECT customer_id, order_count, total_spending, avg_order, CASE WHEN total_spending >= 10000 THEN 'VIP' WHEN total_spending >= 5000 THEN 'Gold' WHEN total_spending >= 1000 THEN 'Silver' ELSE 'Bronze' END AS segment FROM customer_orders)SELECT c.name, c.email, cs.segment, cs.order_count, cs.total_spending, ROUND(cs.avg_order, 2) AS avg_orderFROM customers cJOIN customer_segments cs ON c.customer_id = cs.customer_idORDER BY cs.total_spending DESC; -- CTE referenced multiple timesWITH monthly_sales AS ( SELECT DATE_TRUNC('month', order_date) AS month, SUM(total) AS revenue FROM orders GROUP BY DATE_TRUNC('month', order_date))SELECT curr.month, curr.revenue AS current_revenue, prev.revenue AS previous_revenue, curr.revenue - COALESCE(prev.revenue, 0) AS change, ROUND( 100.0 * (curr.revenue - COALESCE(prev.revenue, 0)) / NULLIF(prev.revenue, 0), 2 ) AS pct_changeFROM monthly_sales currLEFT JOIN monthly_sales prev ON curr.month = prev.month + INTERVAL '1 month'ORDER BY curr.month; -- CTE with INSERT (PostgreSQL, SQL Server)WITH new_orders AS ( SELECT order_id, customer_id, total FROM staging_orders WHERE validated = true)INSERT INTO orders (order_id, customer_id, total, created_at)SELECT order_id, customer_id, total, CURRENT_TIMESTAMPFROM new_orders;Different databases handle CTE optimization differently. PostgreSQL may materialize CTEs (compute once, store result) while others inline them like subqueries. For critical performance, check your database's behavior. PostgreSQL 12+ allows 'AS MATERIALIZED' or 'AS NOT MATERIALIZED' hints.
Recursive CTEs enable queries that reference themselves, essential for traversing hierarchical data (org charts, bill of materials) and graph structures (social networks, dependencies).
Recursive CTE Structure:
WITH RECURSIVE cte_name AS (
-- Anchor member: Starting point (non-recursive)
SELECT ...
UNION [ALL]
-- Recursive member: References CTE itself
SELECT ... FROM cte_name WHERE termination_condition
)
SELECT * FROM cte_name;
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
-- Classic: Employee hierarchy (org chart)WITH RECURSIVE org_hierarchy AS ( -- Anchor: Start with the CEO (no manager) SELECT employee_id, name, manager_id, 1 AS level, name::VARCHAR(1000) AS path FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive: Find direct reports of current level SELECT e.employee_id, e.name, e.manager_id, oh.level + 1, (oh.path || ' > ' || e.name)::VARCHAR(1000) FROM employees e JOIN org_hierarchy oh ON e.manager_id = oh.employee_id)SELECT employee_id, REPEAT(' ', level - 1) || name AS org_chart, level, pathFROM org_hierarchyORDER BY path; -- Bill of Materials: Find all components of a productWITH RECURSIVE bom AS ( -- Anchor: Start with the top-level product SELECT component_id, component_name, 1 AS quantity, 1 AS level FROM components WHERE component_id = 'PROD-001' UNION ALL -- Recursive: Find sub-components SELECT c.component_id, c.component_name, b.quantity * pc.quantity, -- Accumulated quantity b.level + 1 FROM bom b JOIN product_components pc ON b.component_id = pc.parent_id JOIN components c ON pc.component_id = c.component_id WHERE b.level < 10 -- Prevent infinite recursion)SELECT component_id, component_name, SUM(quantity) AS total_needed, MIN(level) AS first_levelFROM bomGROUP BY component_id, component_nameORDER BY first_level, component_name; -- Graph traversal: Find all connections within N degreesWITH RECURSIVE connections AS ( -- Anchor: Direct connections of user 1001 SELECT friend_id AS user_id, 1 AS degree, ARRAY[1001, friend_id] AS path FROM friendships WHERE user_id = 1001 UNION -- Recursive: Friends of friends SELECT f.friend_id, c.degree + 1, c.path || f.friend_id FROM connections c JOIN friendships f ON c.user_id = f.user_id WHERE c.degree < 3 -- Up to 3 degrees AND NOT f.friend_id = ANY(c.path) -- Prevent cycles)SELECT DISTINCT user_id, MIN(degree) AS closest_degreeFROM connectionsGROUP BY user_idORDER BY closest_degree, user_id; -- Generate series (useful for filling gaps)WITH RECURSIVE date_series AS ( SELECT DATE '2024-01-01' AS date UNION ALL SELECT date + INTERVAL '1 day' FROM date_series WHERE date < DATE '2024-12-31')SELECT ds.date, COALESCE(o.order_count, 0) AS ordersFROM date_series dsLEFT JOIN ( SELECT order_date::DATE, COUNT(*) AS order_count FROM orders GROUP BY order_date::DATE) o ON ds.date = o.order_dateORDER BY ds.date;Recursive CTEs can run infinitely if not properly bounded. Always include: (1) A termination condition in the recursive member (WHERE level < N, or path checks), (2) A maximum iteration limit if your database supports it. PostgreSQL's default limit is 1000 iterations.
Many problems can be solved with either subqueries or joins. Understanding when each approach is preferable demonstrates sophisticated SQL thinking.
| Scenario | Prefer Subquery | Prefer Join |
|---|---|---|
| Existence/absence check | EXISTS/NOT EXISTS | — |
| Single aggregate comparison | Scalar subquery | — |
| Multiple columns from related table | — | JOIN |
| Need data from both tables | — | JOIN |
| Filtering against a computed set | IN with subquery | CTE + JOIN |
| Row-by-row computation | Correlated subquery or | Window function |
| Complex multi-step logic | CTEs | CTEs |
| Performance-critical | Depends—test both! | Depends—test both! |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
-- Scenario: Find customers with orders > average order value -- Approach 1: Subquery (clear intent)SELECT customer_id, nameFROM customersWHERE customer_id IN ( SELECT customer_id FROM orders WHERE total > (SELECT AVG(total) FROM orders)); -- Approach 2: JOIN (same result, different style)SELECT DISTINCT c.customer_id, c.nameFROM customers cJOIN orders o ON c.customer_id = o.customer_idCROSS JOIN (SELECT AVG(total) AS avg_total FROM orders) avgWHERE o.total > avg.avg_total; -- Scenario: Products with their sales, including products with no sales -- Subquery approach (awkward for this case)SELECT p.product_id, p.product_name, ( SELECT COALESCE(SUM(oi.quantity), 0) FROM order_items oi WHERE oi.product_id = p.product_id ) AS total_soldFROM products p; -- JOIN approach (more natural)SELECT p.product_id, p.product_name, COALESCE(SUM(oi.quantity), 0) AS total_soldFROM products pLEFT JOIN order_items oi ON p.product_id = oi.product_idGROUP BY p.product_id, p.product_name; -- Scenario: Find the most recent order per customer -- Correlated subquery (classic approach)SELECT o1.*FROM orders o1WHERE o1.order_date = ( SELECT MAX(o2.order_date) FROM orders o2 WHERE o2.customer_id = o1.customer_id); -- CTE with window function (modern approach)WITH ranked_orders AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) AS rn FROM orders)SELECT * FROM ranked_orders WHERE rn = 1; -- Scenario: Existence check (EXISTS is typically best) -- EXISTS (preferred - can short-circuit)SELECT c.customer_id, c.nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'); -- IN (similar, optimizer may transform to EXISTS)SELECT customer_id, nameFROM customersWHERE customer_id IN ( SELECT customer_id FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'); -- JOIN with DISTINCT (potentially less efficient)SELECT DISTINCT c.customer_id, c.nameFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days';In interviews, start with the approach that most clearly expresses your intent. Then, if asked about optimization, discuss alternatives. 'I wrote it with EXISTS for clarity because we only need to check existence, but I could also use an anti-join with LEFT JOIN IS NULL if performance testing suggested it was faster.'
Beyond basic usage, several advanced subquery patterns appear in complex queries and interviews. Mastering these elevates your SQL capabilities significantly.
Derived Tables: Pre-computed Results in FROM
Derived tables (subqueries in FROM clause) create inline virtual tables that can be joined and filtered like regular tables:
12345678910111213141516171819202122232425262728293031323334353637
-- Aggregate before joining to prevent row multiplicationSELECT c.customer_id, c.name, order_stats.total_orders, order_stats.total_spent, item_stats.total_itemsFROM customers cJOIN ( SELECT customer_id, COUNT(*) AS total_orders, SUM(total) AS total_spent FROM orders GROUP BY customer_id) AS order_stats ON c.customer_id = order_stats.customer_idJOIN ( SELECT o.customer_id, COUNT(oi.item_id) AS total_items FROM orders o JOIN order_items oi ON o.order_id = oi.order_id GROUP BY o.customer_id) AS item_stats ON c.customer_id = item_stats.customer_id; -- Filtering on aggregated dataSELECT *FROM ( SELECT category_id, COUNT(*) AS product_count, AVG(unit_price) AS avg_price FROM products GROUP BY category_id) AS category_statsWHERE product_count >= 10 AND avg_price > 50;Let's work through challenging interview problems that showcase subquery mastery:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
/* Problem 1: Find departments where ALL employees earn more than the company average salary */ WITH company_avg AS ( SELECT AVG(salary) AS avg_salary FROM employees)SELECT d.department_id, d.department_nameFROM departments dWHERE NOT EXISTS ( -- Find any employee in this dept below company average SELECT 1 FROM employees e CROSS JOIN company_avg ca WHERE e.department_id = d.department_id AND e.salary <= ca.avg_salary); /* Problem 2: Find the earliest order for each customer, but only for customers who have placed at least 3 orders */ WITH customer_order_counts AS ( SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id HAVING COUNT(*) >= 3),first_orders AS ( SELECT o.* FROM orders o WHERE o.order_date = ( SELECT MIN(o2.order_date) FROM orders o2 WHERE o2.customer_id = o.customer_id ))SELECT fo.*FROM first_orders foWHERE fo.customer_id IN (SELECT customer_id FROM customer_order_counts); /* Problem 3: For each product, calculate its revenue rank within its category and overall, without using window functions */ SELECT p.product_id, p.product_name, p.category_id, ps.product_revenue, ( SELECT COUNT(DISTINCT ps2.product_revenue) + 1 FROM ( SELECT product_id, SUM(quantity * unit_price) AS product_revenue FROM order_items GROUP BY product_id ) ps2 JOIN products p2 ON ps2.product_id = p2.product_id WHERE p2.category_id = p.category_id AND ps2.product_revenue > ps.product_revenue ) AS category_rank, ( SELECT COUNT(DISTINCT ps2.product_revenue) + 1 FROM ( SELECT product_id, SUM(quantity * unit_price) AS product_revenue FROM order_items GROUP BY product_id ) ps2 WHERE ps2.product_revenue > ps.product_revenue ) AS overall_rankFROM products pJOIN ( SELECT product_id, SUM(quantity * unit_price) AS product_revenue FROM order_items GROUP BY product_id) ps ON p.product_id = ps.product_idORDER BY category_id, category_rank; /* Problem 4: Find pairs of products frequently bought together (appearing in the same order at least 5 times) */ SELECT p1.product_name AS product_1, p2.product_name AS product_2, pair_countFROM ( SELECT oi1.product_id AS pid1, oi2.product_id AS pid2, COUNT(DISTINCT oi1.order_id) AS pair_count FROM order_items oi1 JOIN order_items oi2 ON oi1.order_id = oi2.order_id WHERE oi1.product_id < oi2.product_id -- Prevent duplicates and self-pairs GROUP BY oi1.product_id, oi2.product_id HAVING COUNT(DISTINCT oi1.order_id) >= 5) pairsJOIN products p1 ON pairs.pid1 = p1.product_idJOIN products p2 ON pairs.pid2 = p2.product_idORDER BY pair_count DESC, product_1, product_2; /* Problem 5: Find the median order value (without PERCENTILE functions) */ WITH ordered_values AS ( SELECT total, ROW_NUMBER() OVER (ORDER BY total) AS row_asc, ROW_NUMBER() OVER (ORDER BY total DESC) AS row_desc FROM orders)SELECT AVG(total) AS median_valueFROM ordered_valuesWHERE ABS(row_asc - row_desc) <= 1; -- Middle value(s)For complex subquery problems: (1) Identify what the final output should look like, (2) Work backwards—what intermediate results do you need?, (3) Consider if each piece is best expressed as a CTE, derived table, or inline subquery, (4) Build incrementally, testing each piece.
You've now developed comprehensive knowledge of SQL subqueries—from basic scalar subqueries to advanced recursive CTEs and complex interview patterns.
Key Takeaways:
What's Next:
With subquery mastery in place, we'll explore Window Functions in the next page—the powerful analytical capabilities that enable ranking, running totals, moving averages, and row comparisons without self-joins or correlated subqueries.
You now command the full power of SQL subqueries. From simple scalar computations to recursive hierarchical queries, these techniques enable elegant solutions to problems that would otherwise require complex procedural code or multiple queries.