Loading content...
SQL is a declarative language—you describe what you want, not how to get it. But the way you phrase your request profoundly affects the optimizer's ability to find an efficient execution path.
Two semantically equivalent queries can have radically different performance: one might take 10 milliseconds while the other takes 10 minutes. The difference lies in how the query is structured—which constructs you use, how you express conditions, and how you combine data sources.
Query rewriting is the practice of restructuring SQL to guide the optimizer toward better plans. This isn't about tricking the database—it's about expressing your intent in a form the optimizer can work with effectively.
By the end of this page, you will master: subquery transformations (correlated to uncorrelated, subquery to join); efficient patterns for existence checks; avoiding implicit conversions and non-sargable expressions; leveraging CTEs effectively; and practical rewriting techniques for common slow query patterns.
Subqueries are powerful but can be performance traps. Understanding how optimizers handle subqueries—and when manual transformation helps—is essential for query optimization.
Correlated vs. Uncorrelated Subqueries:
The critical distinction is whether the subquery references the outer query:
-- Uncorrelated: find orders above average order total
SELECT * FROM orders
WHERE total > (SELECT AVG(total) FROM orders);
-- The subquery runs once, returning a single value
-- Correlated: find each customer's most recent order
SELECT * FROM orders o1
WHERE created_at = (
SELECT MAX(created_at) FROM orders o2
WHERE o2.customer_id = o1.customer_id -- references outer o1
);
-- The subquery runs for EACH row of o1
Correlated subqueries can be extremely slow on large tables. For 1 million rows in the outer query, the correlated subquery executes 1 million times.
1234567891011121314151617181920212223242526272829303132
-- BEFORE: Correlated subquery (slow)-- Find customers with at least one order over $1000SELECT * FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.total > 1000);-- This can be optimized automatically, but not always -- AFTER: Semi-join with DISTINCT (explicit)SELECT DISTINCT c.* FROM customers cJOIN orders o ON c.id = o.customer_idWHERE o.total > 1000; -- OR: Using EXISTS is often fine - optimizers are smart about it-- But verify with EXPLAIN that it's being decorrelated -- BEFORE: Correlated subquery for maximum (very slow)SELECT * FROM orders o1WHERE o1.total = ( SELECT MAX(o2.total) FROM orders o2 WHERE o2.customer_id = o1.customer_id); -- AFTER: Window function (much faster)SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total DESC) as rn FROM orders) rankedWHERE rn = 1;PostgreSQL, SQL Server, and Oracle's optimizers can often 'decorrelate' subqueries automatically, transforming correlated subqueries into joins. But this isn't guaranteed—complex subqueries, certain function calls, or optimizer limitations may prevent decorrelation. Always EXPLAIN to verify.
Three constructs can express "rows that match something in another table": IN, EXISTS, and JOIN. Each has different performance characteristics depending on the scenario.
| Construct | Best For | Watch Out For |
|---|---|---|
| IN (subquery) | Small result sets; optimizer rewrites to semi-join | Large IN lists can cause plan issues; NULL handling differs from EXISTS |
| EXISTS | Existence checks; can stop at first match | Must be correlated; verify decorrelation happens |
| JOIN | When you need data from both tables; explicit control | Beware of row multiplication with 1:N relationships; may need DISTINCT |
| NOT IN | Exclusion with small sets | ANY NULL in subquery returns empty result—dangerous! |
| NOT EXISTS | Exclusion that handles NULLs correctly | Generally preferred over NOT IN for safety |
| LEFT JOIN WHERE IS NULL | Anti-join pattern; explicit exclusion | Clear semantics; good optimizer support |
12345678910111213141516171819202122232425262728293031323334
-- Scenario: Find customers who have placed orders -- Using INSELECT * FROM customers WHERE id IN (SELECT customer_id FROM orders); -- Using EXISTS SELECT * FROM customers cWHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id); -- Using JOIN with DISTINCTSELECT DISTINCT c.* FROM customers cJOIN orders o ON c.id = o.customer_id; -- All three are logically equivalent for this case-- Check EXPLAIN to see which the optimizer handles best on your data -- Scenario: Find customers who have NOT placed orders -- NOT IN (CAUTION: NULL-unsafe!)SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders WHERE customer_id IS NOT NULL);-- Must exclude NULLs or this may return no rows -- NOT EXISTS (NULL-safe, preferred)SELECT * FROM customers cWHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id); -- LEFT JOIN IS NULL (explicit anti-join)SELECT c.* FROM customers cLEFT JOIN orders o ON c.id = o.customer_idWHERE o.id IS NULL;If the subquery in NOT IN returns ANY null value, the entire NOT IN evaluates to unknown (not true), returning zero rows. This is a notorious bug source: WHERE id NOT IN (1, 2, NULL) matches nothing because id NOT IN NULL is always unknown. Always filter NULLs in NOT IN subqueries, or prefer NOT EXISTS.
A sargable (Search ARGument ABLE) expression is one that can use an index. Non-sargable expressions force full table scans even when relevant indexes exist. Recognizing and rewriting non-sargable patterns is a core optimization skill.
The Core Principle:
For an index on column X to be used, the column must appear alone on one side of the comparison. Any transformation applied to the column forces a full scan:
-- Index on 'amount' column
-- Sargable: column alone
WHERE amount > 1000
-- Non-sargable: function applied
WHERE ROUND(amount) = 1000
-- Non-sargable: column in expression
WHERE amount * 1.1 > 1000
-- Rewritten sargable equivalent:
WHERE amount > 1000 / 1.1
The optimizer can't rearrange expressions because it can't know if transformations are reversible or semantically equivalent for all data types.
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Date functions: Non-sargable to sargable -- BAD: Function on columnWHERE DATE(created_at) = '2024-06-15' -- GOOD: Range on columnWHERE created_at >= '2024-06-15' AND created_at < '2024-06-16' -- BAD: YEAR extractionWHERE YEAR(created_at) = 2024 AND MONTH(created_at) = 6 -- GOOD: Range boundariesWHERE created_at >= '2024-06-01' AND created_at < '2024-07-01' -- String functions: Non-sargable to sargable -- BAD: Case-insensitive via functionWHERE UPPER(username) = 'JOHN' -- GOOD: Use case-insensitive collationWHERE username = 'john' COLLATE utf8mb4_general_ci -- GOOD (PostgreSQL): ILIKE operator (uses trigram index if available)WHERE username ILIKE 'john' -- GOOD: Create functional index (if query can't change)CREATE INDEX idx_upper_username ON users (UPPER(username)); -- Numeric expressions: Rearrange to isolate column -- BADWHERE price * quantity > 1000 -- GOOD (if possible to compute)WHERE price > 1000 / quantity -- Only if quantity never zero -- BETTER: Add computed column and index itALTER TABLE orders ADD total_value DECIMAL GENERATED ALWAYS AS (price * quantity);CREATE INDEX idx_total_value ON orders (total_value);Comparing a VARCHAR column to an INTEGER causes implicit conversion: WHERE user_id = 123 on VARCHAR column becomes WHERE CAST(user_id AS INT) = 123 internally. This is non-sargable. Always match types exactly in comparisons.
Implicit type conversions are silent performance killers. When the database must convert data types to compare values, it often can't use indexes—and you won't see any error indicating this happened.
1234567891011121314151617181920212223242526272829303132333435
-- Column: phone VARCHAR(20)-- Index: CREATE INDEX idx_phone ON users (phone) -- PROBLEM: Comparing VARCHAR to INTEGERSELECT * FROM users WHERE phone = 5551234567;-- Database converts phone to INTEGER for each row-- Index cannot be used; full table scan -- SOLUTION: Match typesSELECT * FROM users WHERE phone = '5551234567';-- Index is used correctly -- Column: status INTEGER-- Index: CREATE INDEX idx_status ON orders (status) -- PROBLEM: Comparing INTEGER to STRINGSELECT * FROM orders WHERE status = '1'; -- String literal-- Some databases (MySQL) handle this; others (PostgreSQL) error -- SOLUTION: Use correct typeSELECT * FROM orders WHERE status = 1; -- Integer literal -- Column: created_at TIMESTAMP WITH TIME ZONE-- Index: CREATE INDEX idx_created ON orders (created_at) -- PROBLEM: Date vs timestamp comparisonSELECT * FROM orders WHERE created_at = '2024-06-15';-- Timestamp column compared to date - may require conversion -- SOLUTION: Compare with same precisionSELECT * FROM orders WHERE created_at >= '2024-06-15'::timestamp AND created_at < '2024-06-16'::timestamp;Diagnosing Implicit Conversions:
In PostgreSQL, these appear in EXPLAIN as explicit CAST operations. In MySQL, the Extra column may show 'Range checked for each record' or converted values. SQL Server's execution plans show Type Conversion warnings.
Prevention:
Object-Relational Mappers often convert types implicitly. A Python None might become SQL NULL correctly, but an integer customer_id passed as string might silently cause full table scans. Always verify generated SQL and execution plans for ORM queries.
Common Table Expressions (CTEs) improve query readability but can have surprising performance implications. Understanding how databases execute CTEs is essential for using them effectively.
Optimization Fence Behavior (Historical):
Historically, CTEs acted as "optimization fences"—the optimizer couldn't push predicates into or out of CTEs. This caused performance problems:
-- CTE as optimization fence (old behavior)
WITH all_orders AS (
SELECT * FROM orders -- Selects ALL orders
)
SELECT * FROM all_orders WHERE customer_id = 123;
-- Without optimization, reads entire orders table
-- With optimization, pushes filter into CTE
Modern behavior:
| Database | CTE Behavior |
|---|---|
| PostgreSQL 12+ | CTEs are inlined by default (NOT MATERIALIZED); can force materialization with MATERIALIZED |
| MySQL 8.0+ | CTEs are merged/inlined like derived tables |
| SQL Server | CTEs are always inlined (never materialized unless explicitly temp table) |
| Oracle | CTEs may be materialized or inlined based on cost |
1234567891011121314151617181920212223242526
-- PostgreSQL: Controlling CTE materialization -- Default: Inlined (optimizable)WITH customer_orders AS ( SELECT * FROM orders WHERE status = 'active')SELECT * FROM customer_orders WHERE customer_id = 123;-- PostgreSQL 12+ inlines this; both filters applied together -- Force materialization (when reused multiple times)WITH customer_orders AS MATERIALIZED ( SELECT * FROM orders WHERE status = 'active')SELECT (SELECT COUNT(*) FROM customer_orders), (SELECT SUM(total) FROM customer_orders);-- CTE is computed once, results reused -- Prevent materialization (even if referenced multiple times)WITH customer_orders AS NOT MATERIALIZED ( SELECT * FROM orders WHERE status = 'active')SELECT * FROM customer_orders WHERE total > 1000UNION ALLSELECT * FROM customer_orders WHERE total < 100;-- CTE is inlined into each usage; may be faster or slowerIn modern databases, a CTE referenced once has nearly identical performance to an equivalent subquery. Choose based on readability. For repeated references, CTEs with MATERIALIZED can be faster than repeating the subquery.
UNION operations combine result sets, but the choice between UNION and UNION ALL—and how you structure compound queries—significantly affects performance.
UNION vs UNION ALL:
| Operator | Behavior | Performance |
|---|---|---|
| UNION | Removes duplicate rows | Requires sort or hash for deduplication |
| UNION ALL | Keeps all rows including duplicates | Simple concatenation, very fast |
The cost of UNION (without ALL):
For large result sets, steps 3-4 are expensive. If you know results are naturally disjoint (no duplicates possible), UNION ALL is dramatically faster.
123456789101112131415161718192021222324252627282930313233
-- BEFORE: Redundant UNION (results are naturally disjoint)SELECT * FROM orders WHERE status = 'pending'UNIONSELECT * FROM orders WHERE status = 'completed'UNIONSELECT * FROM orders WHERE status = 'cancelled';-- UNION sorts/dedupes, but status values don't overlap! -- AFTER: UNION ALL for disjoint setsSELECT * FROM orders WHERE status = 'pending'UNION ALLSELECT * FROM orders WHERE status = 'completed'UNION ALLSELECT * FROM orders WHERE status = 'cancelled';-- Simple concatenation, much faster -- BETTER: Single query with INSELECT * FROM orders WHERE status IN ('pending', 'completed', 'cancelled');-- Optimizer may even use an index range scan -- BEFORE: UNION for conditional logic (slow pattern)SELECT customer_id, 'VIP' as segment FROM customers WHERE lifetime_value > 10000UNION ALLSELECT customer_id, 'Regular' as segment FROM customers WHERE lifetime_value <= 10000;-- Scans customers table twice -- AFTER: Single scan with CASESELECT customer_id, CASE WHEN lifetime_value > 10000 THEN 'VIP' ELSE 'Regular' END as segmentFROM customers;-- Single table scanUse UNION without ALL when: (1) duplicate elimination is required for correctness, not just convenience; (2) sources may genuinely overlap; (3) the final result set is small enough that deduplication cost is acceptable. Always consider if application-level deduplication is more efficient.
OR conditions in WHERE clauses often prevent index usage because the optimizer may not be able to combine separate index accesses. Understanding how to structure OR conditions is crucial for performance.
The OR Problem:
-- Indexes: (customer_id), (email)
SELECT * FROM users
WHERE customer_id = 123 OR email = 'test@example.com';
The optimizer can use idx_customer_id to find rows matching customer_id = 123, and idx_email to find rows matching email = '...'. But these are separate access paths that must be combined.
How databases handle this:
For complex OR conditions across multiple columns, databases often fall back to full scans.
12345678910111213141516171819202122232425262728293031323334353637383940414243
-- PATTERN 1: OR across different columns-- May or may not use indexes effectively -- BEFORE: OR with separate columnsSELECT * FROM users WHERE customer_id = 123 OR email = 'test@example.com'; -- OPTION A: UNION if result sets are smallSELECT * FROM users WHERE customer_id = 123UNIONSELECT * FROM users WHERE email = 'test@example.com';-- Each query uses its optimal index -- OPTION B: Keep OR, verify with EXPLAIN that bitmap/merge is used-- Sometimes the optimizer handles it well -- PATTERN 2: OR on same column (IN is better) -- BEFORE: Multiple ORsSELECT * FROM orders WHERE status = 'pending' OR status = 'processing' OR status = 'ready'; -- AFTER: IN clauseSELECT * FROM orders WHERE status IN ('pending', 'processing', 'ready');-- Optimizer converts to range scan on index -- PATTERN 3: OR across a join condition -- BEFORE: OR in join (problematic)SELECT * FROM orders oJOIN order_items oi ON oi.order_id = o.id OR oi.original_order_id = o.id;-- Optimizer often can't use indexes on either side -- AFTER: UNION the joinsSELECT o.*, oi.* FROM orders oJOIN order_items oi ON oi.order_id = o.idUNIONSELECT o.*, oi.* FROM orders oJOIN order_items oi ON oi.original_order_id = o.id;-- Each join uses its optimal index pathBefore rewriting OR to UNION, check EXPLAIN. If you see BitmapOr (PostgreSQL) or index_merge (MySQL), the optimizer is already handling it reasonably. UNION rewrite adds overhead for combining results. Only rewrite if the current plan shows a sequential scan.
Query rewriting transforms semantically equivalent SQL into forms that execute efficiently. Let's consolidate the essential techniques:
What's Next:
With query structure optimization covered, the final page addresses Common Performance Pitfalls—the mistakes that cause production database problems and how to avoid them systematically.
You now have a comprehensive toolkit for rewriting queries to achieve better performance. These techniques complement indexing and form the core of hands-on query optimization work. Remember: measure before and after every change with EXPLAIN ANALYZE.