Query Optimization - Learning Module

Loading content...

0/273

Query Rewriting

The Art of Asking the Right Question

SQL is a declarative language—you describe what you want, not how to get it. But the way you phrase your request profoundly affects the optimizer's ability to find an efficient execution path.

Two semantically equivalent queries can have radically different performance: one might take 10 milliseconds while the other takes 10 minutes. The difference lies in how the query is structured—which constructs you use, how you express conditions, and how you combine data sources.

Query rewriting is the practice of restructuring SQL to guide the optimizer toward better plans. This isn't about tricking the database—it's about expressing your intent in a form the optimizer can work with effectively.

What You Will Learn

By the end of this page, you will master: subquery transformations (correlated to uncorrelated, subquery to join); efficient patterns for existence checks; avoiding implicit conversions and non-sargable expressions; leveraging CTEs effectively; and practical rewriting techniques for common slow query patterns.

Subquery Transformation Patterns

Subqueries are powerful but can be performance traps. Understanding how optimizers handle subqueries—and when manual transformation helps—is essential for query optimization.

Correlated vs. Uncorrelated Subqueries:

The critical distinction is whether the subquery references the outer query:

Uncorrelated subquery: Executes once, result is reused
Correlated subquery: Executes once per row of the outer query

-- Uncorrelated: find orders above average order total
SELECT * FROM orders 
WHERE total > (SELECT AVG(total) FROM orders);
-- The subquery runs once, returning a single value

-- Correlated: find each customer's most recent order
SELECT * FROM orders o1
WHERE created_at = (
    SELECT MAX(created_at) FROM orders o2 
    WHERE o2.customer_id = o1.customer_id  -- references outer o1
);
-- The subquery runs for EACH row of o1

Correlated subqueries can be extremely slow on large tables. For 1 million rows in the outer query, the correlated subquery executes 1 million times.

Correlated Subquery to Join Transformation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- BEFORE: Correlated subquery (slow)
-- Find customers with at least one order over $1000
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.id AND o.total > 1000
);
-- This can be optimized automatically, but not always
 
-- AFTER: Semi-join with DISTINCT (explicit)
SELECT DISTINCT c.* 
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.total > 1000;
 
-- OR: Using EXISTS is often fine - optimizers are smart about it
-- But verify with EXPLAIN that it's being decorrelated
 
-- BEFORE: Correlated subquery for maximum (very slow)
SELECT * FROM orders o1
WHERE o1.total = (
    SELECT MAX(o2.total) FROM orders o2 
    WHERE o2.customer_id = o1.customer_id
);
 
-- AFTER: Window function (much faster)
SELECT * FROM (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total DESC) as rn
    FROM orders
) ranked
WHERE rn = 1;

Modern Optimizers Often Handle This

PostgreSQL, SQL Server, and Oracle's optimizers can often 'decorrelate' subqueries automatically, transforming correlated subqueries into joins. But this isn't guaranteed—complex subqueries, certain function calls, or optimizer limitations may prevent decorrelation. Always EXPLAIN to verify.

IN, EXISTS, and JOIN Selection

Three constructs can express "rows that match something in another table": IN, EXISTS, and JOIN. Each has different performance characteristics depending on the scenario.

IN vs EXISTS vs JOIN Comparison
Construct	Best For	Watch Out For
IN (subquery)	Small result sets; optimizer rewrites to semi-join	Large IN lists can cause plan issues; NULL handling differs from EXISTS
EXISTS	Existence checks; can stop at first match	Must be correlated; verify decorrelation happens
JOIN	When you need data from both tables; explicit control	Beware of row multiplication with 1:N relationships; may need DISTINCT
NOT IN	Exclusion with small sets	ANY NULL in subquery returns empty result—dangerous!
NOT EXISTS	Exclusion that handles NULLs correctly	Generally preferred over NOT IN for safety
LEFT JOIN WHERE IS NULL	Anti-join pattern; explicit exclusion	Clear semantics; good optimizer support

Equivalent Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Scenario: Find customers who have placed orders
 
-- Using IN
SELECT * FROM customers 
WHERE id IN (SELECT customer_id FROM orders);
 
-- Using EXISTS  
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
 
-- Using JOIN with DISTINCT
SELECT DISTINCT c.* 
FROM customers c
JOIN orders o ON c.id = o.customer_id;
 
-- All three are logically equivalent for this case
-- Check EXPLAIN to see which the optimizer handles best on your data
 
-- Scenario: Find customers who have NOT placed orders
 
-- NOT IN (CAUTION: NULL-unsafe!)
SELECT * FROM customers 
WHERE id NOT IN (SELECT customer_id FROM orders WHERE customer_id IS NOT NULL);
-- Must exclude NULLs or this may return no rows
 
-- NOT EXISTS (NULL-safe, preferred)
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
 
-- LEFT JOIN IS NULL (explicit anti-join)
SELECT c.* 
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.id IS NULL;

The NOT IN Null Trap

If the subquery in NOT IN returns ANY null value, the entire NOT IN evaluates to unknown (not true), returning zero rows. This is a notorious bug source: WHERE id NOT IN (1, 2, NULL) matches nothing because id NOT IN NULL is always unknown. Always filter NULLs in NOT IN subqueries, or prefer NOT EXISTS.

Making Expressions Sargable

A sargable (Search ARGument ABLE) expression is one that can use an index. Non-sargable expressions force full table scans even when relevant indexes exist. Recognizing and rewriting non-sargable patterns is a core optimization skill.

Non-Sargable (Index Unused)

•WHERE YEAR(created_at) = 2024
•WHERE amount + tax > 1000
•WHERE UPPER(email) = 'TEST'
•WHERE name LIKE '%smith'
•WHERE id / 100 = 5
•WHERE status != 'deleted'

Sargable (Index Used)

•WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'
•WHERE amount > 1000 - tax
•WHERE email = 'test' (with CI collation or functional index)
•WHERE name LIKE 'smith%'
•WHERE id >= 500 AND id < 600
•WHERE status IN ('active', 'pending', ...)

The Core Principle:

For an index on column X to be used, the column must appear alone on one side of the comparison. Any transformation applied to the column forces a full scan:

-- Index on 'amount' column

-- Sargable: column alone
WHERE amount > 1000

-- Non-sargable: function applied
WHERE ROUND(amount) = 1000

-- Non-sargable: column in expression
WHERE amount * 1.1 > 1000

-- Rewritten sargable equivalent:
WHERE amount > 1000 / 1.1

The optimizer can't rearrange expressions because it can't know if transformations are reversible or semantically equivalent for all data types.

Common Sargable Rewrites
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Date functions: Non-sargable to sargable
 
-- BAD: Function on column
WHERE DATE(created_at) = '2024-06-15'
 
-- GOOD: Range on column
WHERE created_at >= '2024-06-15' AND created_at < '2024-06-16'
 
-- BAD: YEAR extraction
WHERE YEAR(created_at) = 2024 AND MONTH(created_at) = 6
 
-- GOOD: Range boundaries
WHERE created_at >= '2024-06-01' AND created_at < '2024-07-01'
 
 
-- String functions: Non-sargable to sargable
 
-- BAD: Case-insensitive via function
WHERE UPPER(username) = 'JOHN'
 
-- GOOD: Use case-insensitive collation
WHERE username = 'john' COLLATE utf8mb4_general_ci
 
-- GOOD (PostgreSQL): ILIKE operator (uses trigram index if available)
WHERE username ILIKE 'john'
 
-- GOOD: Create functional index (if query can't change)
CREATE INDEX idx_upper_username ON users (UPPER(username));
 
 
-- Numeric expressions: Rearrange to isolate column
 
-- BAD
WHERE price * quantity > 1000
 
-- GOOD (if possible to compute)
WHERE price > 1000 / quantity  -- Only if quantity never zero
 
-- BETTER: Add computed column and index it
ALTER TABLE orders ADD total_value DECIMAL GENERATED ALWAYS AS (price * quantity);
CREATE INDEX idx_total_value ON orders (total_value);

Implicit Type Conversion Kills Sargability

Comparing a VARCHAR column to an INTEGER causes implicit conversion: WHERE user_id = 123 on VARCHAR column becomes WHERE CAST(user_id AS INT) = 123 internally. This is non-sargable. Always match types exactly in comparisons.

Avoiding Implicit Conversions

Implicit type conversions are silent performance killers. When the database must convert data types to compare values, it often can't use indexes—and you won't see any error indicating this happened.

Implicit Conversion Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Column: phone VARCHAR(20)
-- Index: CREATE INDEX idx_phone ON users (phone)
 
-- PROBLEM: Comparing VARCHAR to INTEGER
SELECT * FROM users WHERE phone = 5551234567;
-- Database converts phone to INTEGER for each row
-- Index cannot be used; full table scan
 
-- SOLUTION: Match types
SELECT * FROM users WHERE phone = '5551234567';
-- Index is used correctly
 
 
-- Column: status INTEGER
-- Index: CREATE INDEX idx_status ON orders (status)
 
-- PROBLEM: Comparing INTEGER to STRING
SELECT * FROM orders WHERE status = '1';  -- String literal
-- Some databases (MySQL) handle this; others (PostgreSQL) error
 
-- SOLUTION: Use correct type
SELECT * FROM orders WHERE status = 1;  -- Integer literal
 
 
-- Column: created_at TIMESTAMP WITH TIME ZONE
-- Index: CREATE INDEX idx_created ON orders (created_at)
 
-- PROBLEM: Date vs timestamp comparison
SELECT * FROM orders WHERE created_at = '2024-06-15';
-- Timestamp column compared to date - may require conversion
 
-- SOLUTION: Compare with same precision
SELECT * FROM orders 
WHERE created_at >= '2024-06-15'::timestamp 
  AND created_at < '2024-06-16'::timestamp;

Common Implicit Conversion Mistakes

•UUID as string — Comparing UUID column to string representation without casting
•Numeric string columns — Storing numbers as VARCHAR but comparing with integer literals
•Date vs timestamp — Column is TIMESTAMP, comparing with DATE literal
•Collation mismatches — Joining tables with different string collations
•Nullable comparisons — Comparing with = NULL instead of IS NULL
•Character set differences — UTF8 vs Latin1 in MySQL requiring conversion

Diagnosing Implicit Conversions:

In PostgreSQL, these appear in EXPLAIN as explicit CAST operations. In MySQL, the Extra column may show 'Range checked for each record' or converted values. SQL Server's execution plans show Type Conversion warnings.

Prevention:

Define column types to match how they'll be queried
Review application code to ensure parameter binding uses correct types
Use parameterized queries with explicit type casting
Create coding standards requiring explicit casts when types differ

ORMs Can Hide Type Mismatches

Object-Relational Mappers often convert types implicitly. A Python None might become SQL NULL correctly, but an integer customer_id passed as string might silently cause full table scans. Always verify generated SQL and execution plans for ORM queries.

Effective Use of CTEs

Common Table Expressions (CTEs) improve query readability but can have surprising performance implications. Understanding how databases execute CTEs is essential for using them effectively.

Optimization Fence Behavior (Historical):

Historically, CTEs acted as "optimization fences"—the optimizer couldn't push predicates into or out of CTEs. This caused performance problems:

-- CTE as optimization fence (old behavior)
WITH all_orders AS (
    SELECT * FROM orders  -- Selects ALL orders
)
SELECT * FROM all_orders WHERE customer_id = 123;
-- Without optimization, reads entire orders table
-- With optimization, pushes filter into CTE

Modern behavior:

Database	CTE Behavior
PostgreSQL 12+	CTEs are inlined by default (NOT MATERIALIZED); can force materialization with MATERIALIZED
MySQL 8.0+	CTEs are merged/inlined like derived tables
SQL Server	CTEs are always inlined (never materialized unless explicitly temp table)
Oracle	CTEs may be materialized or inlined based on cost

CTE Optimization Control
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- PostgreSQL: Controlling CTE materialization
 
-- Default: Inlined (optimizable)
WITH customer_orders AS (
    SELECT * FROM orders WHERE status = 'active'
)
SELECT * FROM customer_orders WHERE customer_id = 123;
-- PostgreSQL 12+ inlines this; both filters applied together
 
-- Force materialization (when reused multiple times)
WITH customer_orders AS MATERIALIZED (
    SELECT * FROM orders WHERE status = 'active'
)
SELECT 
    (SELECT COUNT(*) FROM customer_orders),
    (SELECT SUM(total) FROM customer_orders);
-- CTE is computed once, results reused
 
-- Prevent materialization (even if referenced multiple times)
WITH customer_orders AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE status = 'active'
)
SELECT * FROM customer_orders WHERE total > 1000
UNION ALL
SELECT * FROM customer_orders WHERE total < 100;
-- CTE is inlined into each usage; may be faster or slower

When to Use CTEs

•Readability — Break complex queries into understandable chunks. Modern CTEs have minimal performance cost.
•Recursive queries — CTEs are the standard way to express recursive patterns (tree traversal, graph walking).
•Reuse in single query — When the same intermediate result is needed multiple times, MATERIALIZED can help.
•INSERT/UPDATE/DELETE with RETURNING — CTEs enable complex data modification workflows.
•Window function results — Filter on window function results by wrapping in CTE.

Subquery vs CTE for Derived Tables

In modern databases, a CTE referenced once has nearly identical performance to an equivalent subquery. Choose based on readability. For repeated references, CTEs with MATERIALIZED can be faster than repeating the subquery.

UNION Optimization

UNION operations combine result sets, but the choice between UNION and UNION ALL—and how you structure compound queries—significantly affects performance.

UNION vs UNION ALL:

Operator	Behavior	Performance
UNION	Removes duplicate rows	Requires sort or hash for deduplication
UNION ALL	Keeps all rows including duplicates	Simple concatenation, very fast

The cost of UNION (without ALL):

Execute both queries
Combine results into a temp structure
Sort or hash entire result set to find duplicates
Remove duplicates
Return deduplicated results

For large result sets, steps 3-4 are expensive. If you know results are naturally disjoint (no duplicates possible), UNION ALL is dramatically faster.

UNION Optimization Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- BEFORE: Redundant UNION (results are naturally disjoint)
SELECT * FROM orders WHERE status = 'pending'
UNION
SELECT * FROM orders WHERE status = 'completed'
UNION
SELECT * FROM orders WHERE status = 'cancelled';
-- UNION sorts/dedupes, but status values don't overlap!
 
-- AFTER: UNION ALL for disjoint sets
SELECT * FROM orders WHERE status = 'pending'
UNION ALL
SELECT * FROM orders WHERE status = 'completed'
UNION ALL
SELECT * FROM orders WHERE status = 'cancelled';
-- Simple concatenation, much faster
 
-- BETTER: Single query with IN
SELECT * FROM orders WHERE status IN ('pending', 'completed', 'cancelled');
-- Optimizer may even use an index range scan
 
 
-- BEFORE: UNION for conditional logic (slow pattern)
SELECT customer_id, 'VIP' as segment FROM customers WHERE lifetime_value > 10000
UNION ALL
SELECT customer_id, 'Regular' as segment FROM customers WHERE lifetime_value <= 10000;
-- Scans customers table twice
 
-- AFTER: Single scan with CASE
SELECT 
    customer_id,
    CASE WHEN lifetime_value > 10000 THEN 'VIP' ELSE 'Regular' END as segment
FROM customers;
-- Single table scan

When UNION (Not UNION ALL) Is Required

Use UNION without ALL when: (1) duplicate elimination is required for correctness, not just convenience; (2) sources may genuinely overlap; (3) the final result set is small enough that deduplication cost is acceptable. Always consider if application-level deduplication is more efficient.

OR Condition Optimization

OR conditions in WHERE clauses often prevent index usage because the optimizer may not be able to combine separate index accesses. Understanding how to structure OR conditions is crucial for performance.

The OR Problem:

-- Indexes: (customer_id), (email)

SELECT * FROM users 
WHERE customer_id = 123 OR email = 'test@example.com';

The optimizer can use idx_customer_id to find rows matching customer_id = 123, and idx_email to find rows matching email = '...'. But these are separate access paths that must be combined.

How databases handle this:

Bitmap OR (PostgreSQL): Perform two index scans, create bitmaps, OR them together, fetch matching rows
Index Merge (MySQL): Similar—scan both indexes, merge results
Neither: Full table scan because combining is deemed more expensive

For complex OR conditions across multiple columns, databases often fall back to full scans.

OR Optimization Strategies
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
-- PATTERN 1: OR across different columns
-- May or may not use indexes effectively
 
-- BEFORE: OR with separate columns
SELECT * FROM users 
WHERE customer_id = 123 OR email = 'test@example.com';
 
-- OPTION A: UNION if result sets are small
SELECT * FROM users WHERE customer_id = 123
UNION
SELECT * FROM users WHERE email = 'test@example.com';
-- Each query uses its optimal index
 
-- OPTION B: Keep OR, verify with EXPLAIN that bitmap/merge is used
-- Sometimes the optimizer handles it well
 
 
-- PATTERN 2: OR on same column (IN is better)
 
-- BEFORE: Multiple ORs
SELECT * FROM orders 
WHERE status = 'pending' OR status = 'processing' OR status = 'ready';
 
-- AFTER: IN clause
SELECT * FROM orders 
WHERE status IN ('pending', 'processing', 'ready');
-- Optimizer converts to range scan on index
 
 
-- PATTERN 3: OR across a join condition
 
-- BEFORE: OR in join (problematic)
SELECT * FROM orders o
JOIN order_items oi ON oi.order_id = o.id OR oi.original_order_id = o.id;
-- Optimizer often can't use indexes on either side
 
-- AFTER: UNION the joins
SELECT o.*, oi.* FROM orders o
JOIN order_items oi ON oi.order_id = o.id
UNION
SELECT o.*, oi.* FROM orders o
JOIN order_items oi ON oi.original_order_id = o.id;
-- Each join uses its optimal index path

Check EXPLAIN for Bitmap/Index Merge

Before rewriting OR to UNION, check EXPLAIN. If you see BitmapOr (PostgreSQL) or index_merge (MySQL), the optimizer is already handling it reasonably. UNION rewrite adds overhead for combining results. Only rewrite if the current plan shows a sequential scan.

Summary: Query Rewriting Mastery

Query rewriting transforms semantically equivalent SQL into forms that execute efficiently. Let's consolidate the essential techniques:

Key Takeaways

•Decorrelate subqueries — Transform correlated subqueries to joins or window functions; correlated execution is N times slower.
•Prefer EXISTS for existence checks — EXISTS can stop at first match; NOT EXISTS handles NULLs safely unlike NOT IN.
•Keep expressions sargable — Isolate columns from functions and expressions so indexes can be used. Create functional indexes when queries can't change.
•Match types exactly — Implicit conversions silently prevent index usage. Always compare like types.
•Use UNION ALL when duplicates impossible — UNION without ALL requires expensive deduplication; skip it when results are naturally disjoint.
•CTEs are now optimizable — Modern databases inline CTEs by default. Use MATERIALIZED only when reuse is beneficial.
•Handle OR conditions carefully — OR may prevent index usage. Consider UNION or IN rewrites, but verify with EXPLAIN first.
•Always EXPLAIN before and after — Rewriting should improve the execution plan. If it doesn't, keep the more readable version.

What's Next:

With query structure optimization covered, the final page addresses Common Performance Pitfalls—the mistakes that cause production database problems and how to avoid them systematically.

Page Complete

You now have a comprehensive toolkit for rewriting queries to achieve better performance. These techniques complement indexing and form the core of hands-on query optimization work. Remember: measure before and after every change with EXPLAIN ANALYZE.