Database Management SystemsQuery Profiling

Query Profiling

LevelAdvanced

Duration90 mins

TopicQuery Profiling

4 / 5

Query Rewriting

The Art of Query Transformation

Sometimes the best optimization is to write a different query. Query rewriting transforms SQL statements to achieve identical results with dramatically better performance—changing how you ask the question without changing what you're asking.

The optimizer can only choose among execution plans for the query you wrote. It cannot rewrite your query into a fundamentally different structure. That's your job. A correlated subquery executing 100,000 times might become a single join. A DISTINCT on a large result set might become an EXISTS check. A function in WHERE might become a join to a reference table.

This page establishes systematic rewriting techniques that address common performance anti-patterns, transforming slow queries into efficient ones through structural changes.

What You Will Learn

By the end of this page, you will understand how to identify rewriting opportunities from execution plans and statistics, apply proven transformation patterns for subqueries, joins, aggregations, and predicates, and verify that rewrites preserve correctness while improving performance.

When to Rewrite Queries

Not every slow query needs rewriting. Before restructuring SQL, consider whether simpler solutions exist:

Try These First:

Add indexes — Often the fastest fix; no query changes required
Update statistics — Stale statistics cause bad plans; ANALYZE/UPDATE STATISTICS may resolve issues
Increase memory — Sort/hash spilling may resolve with more work_mem
Use hints — Guide the optimizer without restructuring (last resort, but sometimes necessary)

Rewrite When:

The query structure prevents efficient execution regardless of indexes
Correlated subqueries execute per-row when set-based operations are possible
The optimizer cannot find a good plan because the query is inherently inefficient
Multiple scans of the same data could be consolidated

Common Rewriting Candidates

•Correlated subqueries — Subqueries referencing outer query values, executing once per outer row
•OR conditions on different columns — Forces table scans when indexes exist on each column
•Functions on indexed columns in WHERE — Prevents index usage (non-SARGable predicates)
•SELECT DISTINCT on large result sets — Often indicates a join producing duplicates unnecessarily
•NOT IN with NULLable columns — Produces unexpected results and poor performance
•Implicit type conversions — String/number mismatches prevent index usage
•Cartesian products masked as joins — Missing or incorrect join conditions

The Correctness Imperative

Query rewriting must preserve correctness. A fast query that returns wrong results is worse than a slow query that returns right results. Always verify rewrites against the original using representative test data before deployment.

Subquery Transformations

Subqueries are intuitive to write but often execute inefficiently. Many subqueries can be transformed into joins, which typically perform better because the optimizer has more flexibility in choosing execution order and methods.

Pattern 1: Correlated Subquery → Join

subquery_to_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
-- ================================================================
-- Subquery to Join Transformations
-- ================================================================
 
-- =========================
-- Pattern 1: Correlated Subquery → LEFT JOIN
-- =========================
 
-- BEFORE: Correlated subquery (executes once per customer)
SELECT 
    c.customer_id,
    c.customer_name,
    (SELECT MAX(o.order_date) 
     FROM orders o 
     WHERE o.customer_id = c.customer_id) AS last_order_date
FROM customers c;
 
/*
Execution: For each of 100,000 customers, runs a separate 
subquery against orders table. If orders has 1M rows,
this could be 100,000 index seeks or worse.
*/
 
-- AFTER: LEFT JOIN with aggregation (single pass)
SELECT 
    c.customer_id,
    c.customer_name,
    MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
 
/*
Execution: Single hash join or merge join between customers
and orders, then aggregate. Much more efficient for set operations.
*/
 
-- =========================
-- Pattern 2: IN Subquery → INNER JOIN
-- =========================
 
-- BEFORE: IN with subquery
SELECT *
FROM products p
WHERE p.product_id IN (
    SELECT DISTINCT oi.product_id 
    FROM order_items oi
    WHERE oi.quantity > 100
);
 
-- AFTER: JOIN (eliminates need for DISTINCT processing)
SELECT DISTINCT p.*
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.quantity > 100;
 
-- Or better, using EXISTS (often fastest):
SELECT p.*
FROM products p
WHERE EXISTS (
    SELECT 1 
    FROM order_items oi
    WHERE oi.product_id = p.product_id 
      AND oi.quantity > 100
);
 
/*
EXISTS vs IN vs JOIN comparison:
- IN with subquery: May materialize subquery results
- JOIN: May produce duplicates requiring DISTINCT
- EXISTS: Stops at first match, no duplicate handling needed
 
For "is there any matching row?" questions, EXISTS is typically optimal.
*/
 
-- =========================
-- Pattern 3: NOT IN → LEFT JOIN with NULL check
-- =========================
 
-- BEFORE: NOT IN (DANGEROUS with NULLable columns!)
SELECT c.*
FROM customers c
WHERE c.customer_id NOT IN (
    SELECT o.customer_id 
    FROM orders o 
    WHERE o.status = 'shipped'
);
 
/*
WARNING: If orders.customer_id contains ANY NULL value,
NOT IN returns zero rows! This is SQL semantics, not a bug.
 
Also, NOT IN often prevents efficient anti-join execution.
*/
 
-- AFTER: LEFT JOIN with NULL check (correct and efficient)
SELECT c.*
FROM customers c
LEFT JOIN orders o 
    ON c.customer_id = o.customer_id 
    AND o.status = 'shipped'
WHERE o.customer_id IS NULL;
 
/*
Execution: Hash anti-join or merge anti-join.
Correct behavior regardless of NULL values.
Clear intent: find customers with no matching shipped orders.
*/
 
-- Alternative: NOT EXISTS (often preferred)
SELECT c.*
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 
    FROM orders o
    WHERE o.customer_id = c.customer_id 
      AND o.status = 'shipped'
);
 
/*
NOT EXISTS handles NULLs correctly and 
often generates optimal anti-join plans.
Most readable for "no matching rows" semantics.
*/

Pattern 4: Scalar Subquery in SELECT → JOIN

Scalar subqueries in the SELECT list execute once per output row. For large result sets, this creates severe performance problems.

scalar_subquery_transformation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
-- =========================
-- Pattern 4: Scalar Subquery in SELECT → JOIN
-- =========================
 
-- BEFORE: Multiple scalar subqueries (very slow!)
SELECT 
    o.order_id,
    o.order_date,
    (SELECT c.customer_name 
     FROM customers c 
     WHERE c.customer_id = o.customer_id) AS customer_name,
    (SELECT COUNT(*) 
     FROM order_items oi 
     WHERE oi.order_id = o.order_id) AS item_count,
    (SELECT SUM(oi.quantity * oi.unit_price) 
     FROM order_items oi 
     WHERE oi.order_id = o.order_id) AS order_total
FROM orders o
WHERE o.order_date >= '2024-01-01';
 
/*
For 50,000 orders, this executes:
- 50,000 lookups to customers
- 50,000 COUNT queries against order_items  
- 50,000 SUM queries against order_items
= 150,000 subquery executions!
*/
 
-- AFTER: JOINs with aggregation (3 scans maximum)
SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    oi_stats.item_count,
    oi_stats.order_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN (
    SELECT 
        order_id,
        COUNT(*) AS item_count,
        SUM(quantity * unit_price) AS order_total
    FROM order_items
    GROUP BY order_id
) oi_stats ON o.order_id = oi_stats.order_id
WHERE o.order_date >= '2024-01-01';
 
/*
Execution:
- Single scan of orders (filtered)
- Single scan of customers (hash/merge joined)
- Single scan of order_items (grouped, then joined)
- Much more efficient set-based processing
*/
 
-- =========================
-- CTE Approach for Clarity
-- =========================
 
WITH order_item_stats AS (
    SELECT 
        order_id,
        COUNT(*) AS item_count,
        SUM(quantity * unit_price) AS order_total
    FROM order_items
    GROUP BY order_id
)
SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    ois.item_count,
    ois.order_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_item_stats ois ON o.order_id = ois.order_id
WHERE o.order_date >= '2024-01-01';
 
/*
CTEs improve readability without runtime cost (in most cases).
The optimizer typically inlines non-recursive CTEs.
*/

Optimizer Auto-Transformation

Modern optimizers can automatically transform some subqueries to joins (subquery unnesting/decorrelation). However, this doesn't work for all patterns. When in doubt, write the efficient form explicitly. Check EXPLAIN to verify the optimizer's choice.

Predicate Optimization

Predicates (WHERE and JOIN conditions) often contain hidden performance problems. Small changes to how conditions are expressed can dramatically change execution plans.

SARGable Predicates

SARG = Search ARGument. A SARGable predicate allows the optimizer to use indexes. Non-SARGable predicates force table scans.

predicate_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
-- ================================================================
-- Predicate Optimization Patterns
-- ================================================================
 
-- =========================
-- Pattern 1: Remove Functions from Indexed Columns
-- =========================
 
-- NON-SARGABLE (cannot use index on created_at)
SELECT * FROM orders
WHERE YEAR(created_at) = 2024 AND MONTH(created_at) = 6;
 
-- SARGABLE (uses index on created_at)
SELECT * FROM orders
WHERE created_at >= '2024-06-01' 
  AND created_at < '2024-07-01';
 
/*
Why the difference?
- YEAR(created_at) = 2024: Must evaluate YEAR() for every row
- created_at >= '2024-06-01': Can seek directly in index
 
The optimizer can use an index range scan for direct comparisons
but cannot "see through" function calls to use indexes.
*/
 
-- NON-SARGABLE: Function on column
SELECT * FROM customers WHERE LOWER(email) = 'john@example.com';
 
-- SARGABLE: Apply function to literal instead (if case-insensitive collation)
SELECT * FROM customers WHERE email = 'john@example.com';
 
-- Or create functional index (PostgreSQL, MySQL 8+)
CREATE INDEX idx_customers_email_lower ON customers(LOWER(email));
 
-- =========================
-- Pattern 2: Avoid Implicit Type Conversions
-- =========================
 
-- PROBLEM: customer_id is INT, but we pass string
-- NON-SARGABLE (implicit conversion on every row)
SELECT * FROM orders WHERE customer_id = '12345';
 
-- SARGABLE (correct type, no conversion)
SELECT * FROM orders WHERE customer_id = 12345;
 
-- PROBLEM: Mixed types in JOIN
-- This may convert every row of one table
SELECT * FROM orders o
JOIN order_legacy l ON o.order_id = CAST(l.order_code AS INT);
 
-- Better: Fix the source data or add computed column
ALTER TABLE order_legacy 
ADD order_id_int AS CAST(order_code AS INT) PERSISTED;
CREATE INDEX idx_order_legacy_id ON order_legacy(order_id_int);
 
-- Then join on matching types
SELECT * FROM orders o
JOIN order_legacy l ON o.order_id = l.order_id_int;
 
-- =========================
-- Pattern 3: Transform OR to UNION
-- =========================
 
-- PROBLEM: OR on different columns prevents single index use
-- This often results in a table scan
SELECT * FROM products
WHERE category_id = 5 OR supplier_id = 12;
 
-- SOLUTION: UNION allows index use on each condition
SELECT * FROM products WHERE category_id = 5
UNION
SELECT * FROM products WHERE supplier_id = 12;
 
-- Or UNION ALL with deduplication if overlap is rare
SELECT DISTINCT * FROM (
    SELECT * FROM products WHERE category_id = 5
    UNION ALL
    SELECT * FROM products WHERE supplier_id = 12
) combined;
 
/*
Each UNION branch can use its optimal index:
- First branch uses idx_products_category
- Second branch uses idx_products_supplier
Combined result is still smaller than full table scan.
*/
 
-- =========================
-- Pattern 4: Rewrite LIKE for Index Usage
-- =========================
 
-- NON-SARGABLE: Leading wildcard (cannot use index)
SELECT * FROM products WHERE product_name LIKE '%widget%';
 
-- SARGABLE: Trailing wildcard only (uses index)
SELECT * FROM products WHERE product_name LIKE 'widget%';
 
-- For middle-of-string search, consider:
-- 1. Full-text search indexes
-- 2. Trigram indexes (PostgreSQL pg_trgm)
-- 3. Application-level search (Elasticsearch, etc.)
 
-- PostgreSQL: Trigram index for LIKE '%pattern%'
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_products_name_trgm 
ON products USING gin(product_name gin_trgm_ops);
 
-- Now this can use the trigram index:
SELECT * FROM products WHERE product_name LIKE '%widget%';
 
-- =========================
-- Pattern 5: Transform NOT IN to Anti-Join
-- =========================
 
-- SLOW: NOT IN with large subquery
SELECT p.* FROM products p
WHERE p.category_id NOT IN (
    SELECT c.category_id FROM categories c WHERE c.active = 0
);
 
-- BETTER: LEFT JOIN with NULL check
SELECT p.* FROM products p
LEFT JOIN categories c 
    ON p.category_id = c.category_id AND c.active = 0
WHERE c.category_id IS NULL;
 
-- BEST: NOT EXISTS (clearest intent, usually fastest)
SELECT p.* FROM products p
WHERE NOT EXISTS (
    SELECT 1 FROM categories c 
    WHERE c.category_id = p.category_id AND c.active = 0
);

Hidden Type Conversions

Type conversions often happen silently. When comparing VARCHAR to NVARCHAR, INT to BIGINT, or CHAR to VARCHAR, the optimizer may convert one side, preventing index usage. Check data types in table definitions and ensure predicates use matching types.

Join Optimization

Joins are often the largest contributors to query cost. Rewriting joins can eliminate unnecessary work, reduce intermediate result sizes, and enable more efficient join algorithms.

join_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
-- ================================================================
-- Join Optimization Patterns
-- ================================================================
 
-- =========================
-- Pattern 1: Filter Early, Join Late
-- =========================
 
-- SLOW: Join all rows, then filter
SELECT c.customer_name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
  AND o.status = 'completed'
  AND c.country = 'USA';
 
/*
Order of operations matters. If the optimizer doesn't push 
predicates down, this may join ALL orders to ALL customers
first, then filter.
 
Sometimes we need to make filtering explicit.
*/
 
-- BETTER: Subqueries materialize filtered data first
SELECT c.customer_name, o.order_id, o.total
FROM (
    SELECT customer_id, customer_name
    FROM customers 
    WHERE country = 'USA'
) c
JOIN (
    SELECT customer_id, order_id, total
    FROM orders
    WHERE order_date >= '2024-01-01'
      AND status = 'completed'
) o ON c.customer_id = o.customer_id;
 
/*
This explicitly filters before joining.
Modern optimizers often do this automatically,
but explicit subqueries can help with complex queries.
 
Check EXPLAIN to verify predicate pushdown is happening.
*/
 
-- =========================
-- Pattern 2: Reduce Join Width
-- =========================
 
-- SLOW: Selecting many columns from large table
SELECT 
    o.*,  -- All order columns
    c.*,  -- All customer columns
    oi.*  -- All order_item columns
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01';
 
-- BETTER: Select only needed columns
SELECT 
    o.order_id,
    o.order_date,
    o.total,
    c.customer_name,
    c.email,
    oi.product_id,
    oi.quantity,
    oi.unit_price
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01';
 
/*
Narrower rows = more rows fit in memory = fewer buffer operations
Also enables covering indexes that include only needed columns.
*/
 
-- =========================
-- Pattern 3: Eliminate Unnecessary Joins
-- =========================
 
-- BEFORE: Join to get single column value
SELECT 
    o.order_id,
    o.order_date,
    c.customer_id  -- Only using the key, which we already have!
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
 
-- AFTER: No join needed
SELECT 
    order_id,
    order_date,
    customer_id  -- Already in orders table
FROM orders;
 
-- BEFORE: Join for existence check (returns duplicates)
SELECT DISTINCT o.order_id, o.total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id = 100;
 
-- AFTER: EXISTS (no duplicate handling needed)
SELECT o.order_id, o.total
FROM orders o
WHERE EXISTS (
    SELECT 1 FROM order_items oi
    WHERE oi.order_id = o.order_id 
      AND oi.product_id = 100
);
 
-- =========================
-- Pattern 4: Replace Multiple Joins with UNION
-- =========================
 
-- SLOW: Multiple LEFT JOINs to check multiple conditions
SELECT 
    customer_id,
    CASE 
        WHEN e.email_id IS NOT NULL THEN 'Email'
        WHEN p.phone_id IS NOT NULL THEN 'Phone'
        WHEN a.address_id IS NOT NULL THEN 'Mail'
        ELSE 'Unknown'
    END AS contact_method
FROM customers c
LEFT JOIN email_contacts e ON c.customer_id = e.customer_id
LEFT JOIN phone_contacts p ON c.customer_id = p.customer_id
LEFT JOIN address_contacts a ON c.customer_id = a.customer_id;
 
-- BETTER: UNION approach (for large tables)
SELECT customer_id, 'Email' AS contact_method
FROM customers WHERE customer_id IN (SELECT customer_id FROM email_contacts)
UNION ALL
SELECT customer_id, 'Phone' AS contact_method
FROM customers WHERE customer_id IN (SELECT customer_id FROM phone_contacts)
  AND customer_id NOT IN (SELECT customer_id FROM email_contacts)
UNION ALL
-- ... etc.
 
/*
The optimal approach depends on data distribution.
- If most customers have email: LEFT JOIN may be fine
- If contacts are sparse: EXISTS checks may be faster
- Profile both approaches with real data!
*/
 
-- =========================
-- Pattern 5: Rewrite Theta Joins
-- =========================
 
-- SLOW: Range join (non-equi join)
SELECT e.employee_name, s.salary_grade
FROM employees e
JOIN salary_grades s 
    ON e.salary >= s.min_salary AND e.salary <= s.max_salary;
 
/*
Range joins can't use hash or merge joins efficiently.
They often result in nested loops with full scan of one side.
*/
 
-- OPTIMIZATION OPTIONS:
-- 1. Pre-compute join result in application
-- 2. Use bucketing/binning to convert to equi-join
-- 3. Add computed column for grade lookup
 
-- Example: Add precomputed salary grade to employees
ALTER TABLE employees ADD salary_grade_id INT;
 
-- Update using the range logic (one-time or batch)
UPDATE employees e
SET salary_grade_id = (
    SELECT s.grade_id FROM salary_grades s
    WHERE e.salary >= s.min_salary AND e.salary <= s.max_salary
);
 
-- Now queries can use efficient equi-join
SELECT e.employee_name, s.salary_grade
FROM employees e
JOIN salary_grades s ON e.salary_grade_id = s.grade_id;

Join Order Matters

The optimizer chooses join order, but it has limits on how many orderings it evaluates. For complex queries with many joins, try reordering tables in FROM clause to suggest a good starting point. Some databases respect hints like STRAIGHT_JOIN (MySQL) or explicit join order.

Aggregation Optimization

Aggregation queries—GROUP BY, DISTINCT, and aggregate functions—can be expensive because they require sorting or hashing entire datasets. Strategic rewrites can dramatically reduce the data volume processed.

aggregation_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
-- ================================================================
-- Aggregation Optimization Patterns
-- ================================================================
 
-- =========================
-- Pattern 1: Filter Before Aggregating
-- =========================
 
-- SLOW: Aggregate all, then filter with HAVING
SELECT 
    product_id,
    SUM(quantity) AS total_quantity
FROM order_items
GROUP BY product_id
HAVING SUM(quantity) > 1000;
 
/*
This aggregates ALL products, then discards most results.
If 95% of products have quantity < 1000, wasted work.
*/
 
-- Better approach when possible: Pre-filter if column exists
SELECT 
    product_id,
    SUM(quantity) AS total_quantity
FROM order_items oi
WHERE EXISTS (
    -- Only include products that COULD exceed threshold
    SELECT 1 FROM products p 
    WHERE p.product_id = oi.product_id 
      AND p.is_popular = 1  -- Some pre-filter criterion
)
GROUP BY product_id
HAVING SUM(quantity) > 1000;
 
-- Or maintain running totals separately for large datasets
-- (denormalization for read performance)
 
-- =========================
-- Pattern 2: Use DISTINCT vs GROUP BY Appropriately
-- =========================
 
-- These are semantically equivalent:
SELECT DISTINCT customer_id FROM orders;
SELECT customer_id FROM orders GROUP BY customer_id;
 
/*
Performance may differ:
- DISTINCT: Optimized for exact duplicates
- GROUP BY: Optimized when aggregates are needed
 
Generally, use DISTINCT for simple deduplication,
GROUP BY when calculating aggregates.
Check EXPLAIN to see if optimizer chooses different strategies.
*/
 
-- =========================
-- Pattern 3: Approximate Aggregation for Analytics
-- =========================
 
-- EXACT (potentially expensive for billions of rows)
SELECT COUNT(DISTINCT user_id) AS unique_users
FROM page_views
WHERE event_date >= '2024-01-01';
 
-- APPROXIMATE (much faster for analytics use cases)
-- PostgreSQL uses HyperLogLog internally
SELECT COUNT(DISTINCT user_id) AS unique_users
FROM page_views
WHERE event_date >= '2024-01-01';
 
-- For explicit approximate counting (if supported):
-- SQL Server: APPROX_COUNT_DISTINCT (2019+)
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_unique_users
FROM page_views
WHERE event_date >= '2024-01-01';
 
/*
Approximate functions trade 1-2% accuracy for 10-100x speed.
Excellent for dashboards, analytics, monitoring.
Not appropriate for financial reporting or exact counts.
*/
 
-- =========================
-- Pattern 4: Partial Aggregation with Pre-aggregated Tables
-- =========================
 
-- SLOW: Aggregate from transaction table every time
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(total) AS revenue
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY DATE_TRUNC('month', order_date);
 
-- FAST: Maintain pre-aggregated summary table
-- Daily aggregation table (maintained by batch job or trigger)
CREATE TABLE order_daily_summary (
    summary_date DATE PRIMARY KEY,
    order_count INT,
    total_revenue DECIMAL(18,2)
);
 
-- Query from summary instead
SELECT 
    DATE_TRUNC('month', summary_date) AS month,
    SUM(order_count) AS order_count,
    SUM(total_revenue) AS revenue
FROM order_daily_summary
WHERE summary_date >= '2020-01-01'
GROUP BY DATE_TRUNC('month', summary_date);
 
/*
Pre-aggregation trades storage and update complexity
for dramatically faster reads. Standard practice for:
- Analytics dashboards
- Reporting queries
- Time-series data
 
Consider update strategy: batch, incremental, or trigger-based.
*/
 
-- =========================
-- Pattern 5: Window Functions vs JOIN for Rankings
-- =========================
 
-- SLOW: Self-join for "top per group"
SELECT o.*
FROM orders o
WHERE o.total = (
    SELECT MAX(o2.total) 
    FROM orders o2 
    WHERE o2.customer_id = o.customer_id
);
 
/*
Correlated subquery executes once per row.
For 1M orders, this is slow.
*/
 
-- FAST: Window function (single pass)
SELECT *
FROM (
    SELECT 
        o.*,
        ROW_NUMBER() OVER(
            PARTITION BY customer_id 
            ORDER BY total DESC
        ) AS rn
    FROM orders o
) ranked
WHERE rn = 1;
 
/*
Window function processes all rows in a single pass,
partitioning and ranking efficiently.
Modern approach for any "top N per group" problem.
*/

COUNT(*) vs COUNT(column)

COUNT() counts all rows; COUNT(column) counts non-NULL values. COUNT() can be optimized using statistics or narrow indexes. COUNT(column) may require checking each value. Use COUNT(*) for row counts unless you specifically need to exclude NULLs.

CTE and Derived Table Optimization

CTEs (Common Table Expressions) and derived tables (subqueries in FROM) provide structure and readability. However, their performance characteristics differ across databases—understanding these differences is essential for optimization.

cte_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
-- ================================================================
-- CTE and Derived Table Optimization
-- ================================================================
 
-- =========================
-- Understanding CTE Execution Models
-- =========================
 
/*
Database systems handle CTEs differently:
 
PostgreSQL (pre-12): CTEs were ALWAYS materialized
  - Good: Prevents repeated execution
  - Bad: Prevents predicate pushdown
 
PostgreSQL (12+): CTEs are inlined unless:
  - Referenced multiple times
  - Marked with MATERIALIZED hint
  
SQL Server: CTEs are typically inlined
  - Optimizer treats as views
  - May execute multiple times if referenced multiple times
  
MySQL (8+): Similar to SQL Server
  - Inlining is default
  
Oracle: Can be materialized with hints
  - MATERIALIZE or INLINE hints available
*/
 
-- =========================
-- When to Force Materialization
-- =========================
 
-- SCENARIO: CTE used multiple times
WITH expensive_calculation AS (
    SELECT 
        customer_id,
        SUM(total) AS lifetime_value
    FROM orders
    WHERE order_date >= '2020-01-01'
    GROUP BY customer_id
)
SELECT * FROM expensive_calculation WHERE lifetime_value > 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE lifetime_value BETWEEN 5000 AND 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE lifetime_value BETWEEN 1000 AND 5000;
 
/*
Without materialization, the CTE may execute THREE times.
Force materialization when CTE is referenced multiple times.
*/
 
-- PostgreSQL 12+: Force materialization
WITH expensive_calculation AS MATERIALIZED (
    SELECT 
        customer_id,
        SUM(total) AS lifetime_value
    FROM orders
    WHERE order_date >= '2020-01-01'
    GROUP BY customer_id
)
SELECT * FROM expensive_calculation WHERE lifetime_value > 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE lifetime_value BETWEEN 5000 AND 10000;
 
-- SQL Server: Use temp table for explicit materialization
SELECT 
    customer_id,
    SUM(total) AS lifetime_value
INTO #expensive_calc
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY customer_id;
 
CREATE INDEX idx_temp ON #expensive_calc(lifetime_value);
 
SELECT * FROM #expensive_calc WHERE lifetime_value > 10000
UNION ALL
SELECT * FROM #expensive_calc WHERE lifetime_value BETWEEN 5000 AND 10000;
 
DROP TABLE #expensive_calc;
 
-- =========================
-- When to Force Inlining
-- =========================
 
-- SCENARIO: CTE prevents predicate pushdown
WITH recent_orders AS (
    SELECT * FROM orders  -- All orders selected
)
SELECT * FROM recent_orders 
WHERE customer_id = 12345;
 
/*
In PostgreSQL < 12, this materialized ALL orders,
then filtered. Very inefficient.
 
PostgreSQL 12+ inlines by default, allowing pushdown.
*/
 
-- PostgreSQL 12+: Explicit inline (unnecessary but available)
WITH recent_orders AS NOT MATERIALIZED (
    SELECT * FROM orders
)
SELECT * FROM recent_orders 
WHERE customer_id = 12345;
 
-- Best practice: Write the predicate in the CTE
WITH recent_orders AS (
    SELECT * FROM orders
    WHERE customer_id = 12345  -- Filter pushed into CTE
)
SELECT * FROM recent_orders;
 
-- =========================
-- Recursive CTE Optimization
-- =========================
 
-- Recursive CTEs can be expensive; optimize the anchor and recursive terms
 
-- SLOW: Unoptimized recursive CTE
WITH RECURSIVE org_tree AS (
    -- Anchor: All employees (expensive!)
    SELECT employee_id, manager_id, name, 1 AS level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: Join to find reports
    SELECT e.employee_id, e.manager_id, e.name, ot.level + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree WHERE employee_id = 12345;
 
/*
This builds the ENTIRE org tree, then filters.
If you only need one employee's path, start from there instead.
*/
 
-- FAST: Start from target, walk up
WITH RECURSIVE emp_path AS (
    -- Anchor: Start from specific employee
    SELECT employee_id, manager_id, name, 1 AS level
    FROM employees
    WHERE employee_id = 12345
    
    UNION ALL
    
    -- Recursive: Walk up to manager
    SELECT e.employee_id, e.manager_id, e.name, ep.level + 1
    FROM employees e
    JOIN emp_path ep ON e.employee_id = ep.manager_id
)
SELECT * FROM emp_path;
 
/*
Processes only the path from employee to CEO,
not the entire organization.
*/

Check the Execution Plan

Don't assume CTE behavior—verify with EXPLAIN. Look for "CTE Scan" (materialized) vs. inline evaluation. The optimizer's choice may differ from your expectation, especially after database upgrades.

Verification and Testing

Query rewrites must be verified for both correctness and performance improvement. A systematic testing approach prevents introducing bugs or regressions.

Rewrite Verification Checklist

•Row count match — Original and rewritten queries should return the same number of rows
•Content match — EXCEPT/MINUS between old and new should produce zero rows both ways
•NULL handling — Verify behavior with NULL values in key columns
•Edge cases — Empty result sets, single row, maximum boundary values
•Ordering — If ORDER BY is specified, results should be in identical order
•Performance improvement — Measure with EXPLAIN ANALYZE on representative data
•Plan stability — Verify optimal plan is chosen across different parameter values

rewrite_verification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
-- ================================================================
-- Query Rewrite Verification Techniques
-- ================================================================
 
-- =========================
-- Technique 1: Row Count Comparison
-- =========================
 
-- Count from original
SELECT COUNT(*) AS original_count FROM (
    -- Original query here
    SELECT c.customer_id, c.customer_name,
           (SELECT MAX(o.order_date) FROM orders o 
            WHERE o.customer_id = c.customer_id) AS last_order
    FROM customers c
) original;
 
-- Count from rewritten
SELECT COUNT(*) AS rewritten_count FROM (
    -- Rewritten query here
    SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
) rewritten;
 
-- Should be equal!
 
-- =========================
-- Technique 2: EXCEPT Comparison (Bi-directional)
-- =========================
 
-- Rows in original but not in rewritten (should be empty)
(
    SELECT c.customer_id, c.customer_name,
           (SELECT MAX(o.order_date) FROM orders o 
            WHERE o.customer_id = c.customer_id) AS last_order
    FROM customers c
)
EXCEPT
(
    SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
);
 
-- Rows in rewritten but not in original (should be empty)
(
    SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
)
EXCEPT
(
    SELECT c.customer_id, c.customer_name,
           (SELECT MAX(o.order_date) FROM orders o 
            WHERE o.customer_id = c.customer_id) AS last_order
    FROM customers c
);
 
-- Both should return 0 rows!
 
-- =========================
-- Technique 3: Performance Comparison
-- =========================
 
-- Original query timing
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT c.customer_id, c.customer_name,
       (SELECT MAX(o.order_date) FROM orders o 
        WHERE o.customer_id = c.customer_id) AS last_order
FROM customers c;
 
/*
Capture:
- Execution time
- Buffer hits/reads
- Rows examined
*/
 
-- Rewritten query timing
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
 
/*
Compare metrics:
- Is execution time lower?
- Are buffer reads reduced?
- Is CPU time reduced?
 
Only deploy if clearly better across representative cases.
*/
 
-- =========================
-- Technique 4: Parameter Sensitivity Testing
-- =========================
 
-- Test with different parameters that produce different data volumes
 
-- Small result set parameter
EXPLAIN ANALYZE SELECT ... WHERE customer_id = 1;
 
-- Large result set parameter  
EXPLAIN ANALYZE SELECT ... WHERE customer_id BETWEEN 1 AND 10000;
 
-- NULL parameter (if applicable)
EXPLAIN ANALYZE SELECT ... WHERE customer_id IS NULL;
 
-- Verify optimal plan chosen across parameter variations
 
-- =========================
-- Technique 5: NULL Behavior Verification
-- =========================
 
-- Insert test data with NULLs if not present
-- Then verify both queries handle identically
 
-- Check NULL count matches
SELECT 
    SUM(CASE WHEN last_order IS NULL THEN 1 ELSE 0 END) AS null_count,
    SUM(CASE WHEN last_order IS NOT NULL THEN 1 ELSE 0 END) AS non_null_count
FROM (
    -- Original query
) original;
 
-- Compare to rewritten
SELECT 
    SUM(CASE WHEN last_order IS NULL THEN 1 ELSE 0 END) AS null_count,
    SUM(CASE WHEN last_order IS NOT NULL THEN 1 ELSE 0 END) AS non_null_count
FROM (
    -- Rewritten query
) rewritten;

Summary: Query Rewriting

Query rewriting transforms structurally inefficient SQL into efficient alternatives. Unlike index tuning which helps the optimizer execute your query better, rewriting gives the optimizer fundamentally better queries to work with.

Key Takeaways

•Rewrite when structure prevents optimization — Indexes can't fix inherently inefficient query patterns.
•Transform subqueries to joins — Correlated subqueries often become efficient JOINs or EXISTS checks.
•Make predicates SARGable — Remove functions from indexed columns; ensure type compatibility.
•Optimize join patterns — Filter early, join late; remove unnecessary joins; reduce join width.
•Pre-aggregate when possible — Summary tables trade storage/complexity for dramatically faster reads.
•Understand CTE behavior — Materialization vs. inlining differs by database and version.
•Always verify rewrites — Correctness first, then performance. Use EXCEPT and count comparisons.

What's next:

Query rewriting addresses individual queries. The final page covers continuous improvement—establishing processes and practices that maintain and improve query performance over time, across development, deployment, and production operations.

Page Complete

You now possess a comprehensive toolkit for query rewriting—subquery transformations, predicate optimization, join restructuring, and aggregation techniques. Combined with rigorous verification practices, you can confidently transform slow queries into efficient ones.

4 / 5

Loading learning content...

Database Management SystemsQuery Profiling

Query Profiling

LevelAdvanced

Duration90 mins

TopicQuery Profiling

4 / 5

Query Rewriting

The Art of Query Transformation

This page establishes systematic rewriting techniques that address common performance anti-patterns, transforming slow queries into efficient ones through structural changes.

What You Will Learn

When to Rewrite Queries

Not every slow query needs rewriting. Before restructuring SQL, consider whether simpler solutions exist:

Try These First:

Add indexes — Often the fastest fix; no query changes required
Update statistics — Stale statistics cause bad plans; ANALYZE/UPDATE STATISTICS may resolve issues
Increase memory — Sort/hash spilling may resolve with more work_mem
Use hints — Guide the optimizer without restructuring (last resort, but sometimes necessary)

Rewrite When:

The query structure prevents efficient execution regardless of indexes
Correlated subqueries execute per-row when set-based operations are possible
The optimizer cannot find a good plan because the query is inherently inefficient
Multiple scans of the same data could be consolidated

Common Rewriting Candidates

•Correlated subqueries — Subqueries referencing outer query values, executing once per outer row
•OR conditions on different columns — Forces table scans when indexes exist on each column
•Functions on indexed columns in WHERE — Prevents index usage (non-SARGable predicates)
•SELECT DISTINCT on large result sets — Often indicates a join producing duplicates unnecessarily
•NOT IN with NULLable columns — Produces unexpected results and poor performance
•Implicit type conversions — String/number mismatches prevent index usage
•Cartesian products masked as joins — Missing or incorrect join conditions

The Correctness Imperative

Subquery Transformations

Pattern 1: Correlated Subquery → Join

subquery_to_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
-- ================================================================
-- Subquery to Join Transformations
-- ================================================================
 
-- =========================
-- Pattern 1: Correlated Subquery → LEFT JOIN
-- =========================
 
-- BEFORE: Correlated subquery (executes once per customer)
SELECT 
    c.customer_id,
    c.customer_name,
    (SELECT MAX(o.order_date) 
     FROM orders o 
     WHERE o.customer_id = c.customer_id) AS last_order_date
FROM customers c;
 
/*
Execution: For each of 100,000 customers, runs a separate 
subquery against orders table. If orders has 1M rows,
this could be 100,000 index seeks or worse.
*/
 
-- AFTER: LEFT JOIN with aggregation (single pass)
SELECT 
    c.customer_id,
    c.customer_name,
    MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
 
/*
Execution: Single hash join or merge join between customers
and orders, then aggregate. Much more efficient for set operations.
*/
 
-- =========================
-- Pattern 2: IN Subquery → INNER JOIN
-- =========================
 
-- BEFORE: IN with subquery
SELECT *
FROM products p
WHERE p.product_id IN (
    SELECT DISTINCT oi.product_id 
    FROM order_items oi
    WHERE oi.quantity > 100
);
 
-- AFTER: JOIN (eliminates need for DISTINCT processing)
SELECT DISTINCT p.*
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.quantity > 100;
 
-- Or better, using EXISTS (often fastest):
SELECT p.*
FROM products p
WHERE EXISTS (
    SELECT 1 
    FROM order_items oi
    WHERE oi.product_id = p.product_id 
      AND oi.quantity > 100
);
 
/*
EXISTS vs IN vs JOIN comparison:
- IN with subquery: May materialize subquery results
- JOIN: May produce duplicates requiring DISTINCT
- EXISTS: Stops at first match, no duplicate handling needed
 
For "is there any matching row?" questions, EXISTS is typically optimal.
*/
 
-- =========================
-- Pattern 3: NOT IN → LEFT JOIN with NULL check
-- =========================
 
-- BEFORE: NOT IN (DANGEROUS with NULLable columns!)
SELECT c.*
FROM customers c
WHERE c.customer_id NOT IN (
    SELECT o.customer_id 
    FROM orders o 
    WHERE o.status = 'shipped'
);
 
/*
WARNING: If orders.customer_id contains ANY NULL value,
NOT IN returns zero rows! This is SQL semantics, not a bug.
 
Also, NOT IN often prevents efficient anti-join execution.
*/
 
-- AFTER: LEFT JOIN with NULL check (correct and efficient)
SELECT c.*
FROM customers c
LEFT JOIN orders o 
    ON c.customer_id = o.customer_id 
    AND o.status = 'shipped'
WHERE o.customer_id IS NULL;
 
/*
Execution: Hash anti-join or merge anti-join.
Correct behavior regardless of NULL values.
Clear intent: find customers with no matching shipped orders.
*/
 
-- Alternative: NOT EXISTS (often preferred)
SELECT c.*
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 
    FROM orders o
    WHERE o.customer_id = c.customer_id 
      AND o.status = 'shipped'
);
 
/*
NOT EXISTS handles NULLs correctly and 
often generates optimal anti-join plans.
Most readable for "no matching rows" semantics.
*/

Pattern 4: Scalar Subquery in SELECT → JOIN

Scalar subqueries in the SELECT list execute once per output row. For large result sets, this creates severe performance problems.

scalar_subquery_transformation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
-- =========================
-- Pattern 4: Scalar Subquery in SELECT → JOIN
-- =========================
 
-- BEFORE: Multiple scalar subqueries (very slow!)
SELECT 
    o.order_id,
    o.order_date,
    (SELECT c.customer_name 
     FROM customers c 
     WHERE c.customer_id = o.customer_id) AS customer_name,
    (SELECT COUNT(*) 
     FROM order_items oi 
     WHERE oi.order_id = o.order_id) AS item_count,
    (SELECT SUM(oi.quantity * oi.unit_price) 
     FROM order_items oi 
     WHERE oi.order_id = o.order_id) AS order_total
FROM orders o
WHERE o.order_date >= '2024-01-01';
 
/*
For 50,000 orders, this executes:
- 50,000 lookups to customers
- 50,000 COUNT queries against order_items  
- 50,000 SUM queries against order_items
= 150,000 subquery executions!
*/
 
-- AFTER: JOINs with aggregation (3 scans maximum)
SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    oi_stats.item_count,
    oi_stats.order_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN (
    SELECT 
        order_id,
        COUNT(*) AS item_count,
        SUM(quantity * unit_price) AS order_total
    FROM order_items
    GROUP BY order_id
) oi_stats ON o.order_id = oi_stats.order_id
WHERE o.order_date >= '2024-01-01';
 
/*
Execution:
- Single scan of orders (filtered)
- Single scan of customers (hash/merge joined)
- Single scan of order_items (grouped, then joined)
- Much more efficient set-based processing
*/
 
-- =========================
-- CTE Approach for Clarity
-- =========================
 
WITH order_item_stats AS (
    SELECT 
        order_id,
        COUNT(*) AS item_count,
        SUM(quantity * unit_price) AS order_total
    FROM order_items
    GROUP BY order_id
)
SELECT 
    o.order_id,
    o.order_date,
    c.customer_name,
    ois.item_count,
    ois.order_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_item_stats ois ON o.order_id = ois.order_id
WHERE o.order_date >= '2024-01-01';
 
/*
CTEs improve readability without runtime cost (in most cases).
The optimizer typically inlines non-recursive CTEs.
*/

Optimizer Auto-Transformation

Predicate Optimization

Predicates (WHERE and JOIN conditions) often contain hidden performance problems. Small changes to how conditions are expressed can dramatically change execution plans.

SARGable Predicates

SARG = Search ARGument. A SARGable predicate allows the optimizer to use indexes. Non-SARGable predicates force table scans.

predicate_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
-- ================================================================
-- Predicate Optimization Patterns
-- ================================================================
 
-- =========================
-- Pattern 1: Remove Functions from Indexed Columns
-- =========================
 
-- NON-SARGABLE (cannot use index on created_at)
SELECT * FROM orders
WHERE YEAR(created_at) = 2024 AND MONTH(created_at) = 6;
 
-- SARGABLE (uses index on created_at)
SELECT * FROM orders
WHERE created_at >= '2024-06-01' 
  AND created_at < '2024-07-01';
 
/*
Why the difference?
- YEAR(created_at) = 2024: Must evaluate YEAR() for every row
- created_at >= '2024-06-01': Can seek directly in index
 
The optimizer can use an index range scan for direct comparisons
but cannot "see through" function calls to use indexes.
*/
 
-- NON-SARGABLE: Function on column
SELECT * FROM customers WHERE LOWER(email) = 'john@example.com';
 
-- SARGABLE: Apply function to literal instead (if case-insensitive collation)
SELECT * FROM customers WHERE email = 'john@example.com';
 
-- Or create functional index (PostgreSQL, MySQL 8+)
CREATE INDEX idx_customers_email_lower ON customers(LOWER(email));
 
-- =========================
-- Pattern 2: Avoid Implicit Type Conversions
-- =========================
 
-- PROBLEM: customer_id is INT, but we pass string
-- NON-SARGABLE (implicit conversion on every row)
SELECT * FROM orders WHERE customer_id = '12345';
 
-- SARGABLE (correct type, no conversion)
SELECT * FROM orders WHERE customer_id = 12345;
 
-- PROBLEM: Mixed types in JOIN
-- This may convert every row of one table
SELECT * FROM orders o
JOIN order_legacy l ON o.order_id = CAST(l.order_code AS INT);
 
-- Better: Fix the source data or add computed column
ALTER TABLE order_legacy 
ADD order_id_int AS CAST(order_code AS INT) PERSISTED;
CREATE INDEX idx_order_legacy_id ON order_legacy(order_id_int);
 
-- Then join on matching types
SELECT * FROM orders o
JOIN order_legacy l ON o.order_id = l.order_id_int;
 
-- =========================
-- Pattern 3: Transform OR to UNION
-- =========================
 
-- PROBLEM: OR on different columns prevents single index use
-- This often results in a table scan
SELECT * FROM products
WHERE category_id = 5 OR supplier_id = 12;
 
-- SOLUTION: UNION allows index use on each condition
SELECT * FROM products WHERE category_id = 5
UNION
SELECT * FROM products WHERE supplier_id = 12;
 
-- Or UNION ALL with deduplication if overlap is rare
SELECT DISTINCT * FROM (
    SELECT * FROM products WHERE category_id = 5
    UNION ALL
    SELECT * FROM products WHERE supplier_id = 12
) combined;
 
/*
Each UNION branch can use its optimal index:
- First branch uses idx_products_category
- Second branch uses idx_products_supplier
Combined result is still smaller than full table scan.
*/
 
-- =========================
-- Pattern 4: Rewrite LIKE for Index Usage
-- =========================
 
-- NON-SARGABLE: Leading wildcard (cannot use index)
SELECT * FROM products WHERE product_name LIKE '%widget%';
 
-- SARGABLE: Trailing wildcard only (uses index)
SELECT * FROM products WHERE product_name LIKE 'widget%';
 
-- For middle-of-string search, consider:
-- 1. Full-text search indexes
-- 2. Trigram indexes (PostgreSQL pg_trgm)
-- 3. Application-level search (Elasticsearch, etc.)
 
-- PostgreSQL: Trigram index for LIKE '%pattern%'
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_products_name_trgm 
ON products USING gin(product_name gin_trgm_ops);
 
-- Now this can use the trigram index:
SELECT * FROM products WHERE product_name LIKE '%widget%';
 
-- =========================
-- Pattern 5: Transform NOT IN to Anti-Join
-- =========================
 
-- SLOW: NOT IN with large subquery
SELECT p.* FROM products p
WHERE p.category_id NOT IN (
    SELECT c.category_id FROM categories c WHERE c.active = 0
);
 
-- BETTER: LEFT JOIN with NULL check
SELECT p.* FROM products p
LEFT JOIN categories c 
    ON p.category_id = c.category_id AND c.active = 0
WHERE c.category_id IS NULL;
 
-- BEST: NOT EXISTS (clearest intent, usually fastest)
SELECT p.* FROM products p
WHERE NOT EXISTS (
    SELECT 1 FROM categories c 
    WHERE c.category_id = p.category_id AND c.active = 0
);

Hidden Type Conversions

Join Optimization

Joins are often the largest contributors to query cost. Rewriting joins can eliminate unnecessary work, reduce intermediate result sizes, and enable more efficient join algorithms.

join_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
-- ================================================================
-- Join Optimization Patterns
-- ================================================================
 
-- =========================
-- Pattern 1: Filter Early, Join Late
-- =========================
 
-- SLOW: Join all rows, then filter
SELECT c.customer_name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
  AND o.status = 'completed'
  AND c.country = 'USA';
 
/*
Order of operations matters. If the optimizer doesn't push 
predicates down, this may join ALL orders to ALL customers
first, then filter.
 
Sometimes we need to make filtering explicit.
*/
 
-- BETTER: Subqueries materialize filtered data first
SELECT c.customer_name, o.order_id, o.total
FROM (
    SELECT customer_id, customer_name
    FROM customers 
    WHERE country = 'USA'
) c
JOIN (
    SELECT customer_id, order_id, total
    FROM orders
    WHERE order_date >= '2024-01-01'
      AND status = 'completed'
) o ON c.customer_id = o.customer_id;
 
/*
This explicitly filters before joining.
Modern optimizers often do this automatically,
but explicit subqueries can help with complex queries.
 
Check EXPLAIN to verify predicate pushdown is happening.
*/
 
-- =========================
-- Pattern 2: Reduce Join Width
-- =========================
 
-- SLOW: Selecting many columns from large table
SELECT 
    o.*,  -- All order columns
    c.*,  -- All customer columns
    oi.*  -- All order_item columns
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01';
 
-- BETTER: Select only needed columns
SELECT 
    o.order_id,
    o.order_date,
    o.total,
    c.customer_name,
    c.email,
    oi.product_id,
    oi.quantity,
    oi.unit_price
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01';
 
/*
Narrower rows = more rows fit in memory = fewer buffer operations
Also enables covering indexes that include only needed columns.
*/
 
-- =========================
-- Pattern 3: Eliminate Unnecessary Joins
-- =========================
 
-- BEFORE: Join to get single column value
SELECT 
    o.order_id,
    o.order_date,
    c.customer_id  -- Only using the key, which we already have!
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
 
-- AFTER: No join needed
SELECT 
    order_id,
    order_date,
    customer_id  -- Already in orders table
FROM orders;
 
-- BEFORE: Join for existence check (returns duplicates)
SELECT DISTINCT o.order_id, o.total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id = 100;
 
-- AFTER: EXISTS (no duplicate handling needed)
SELECT o.order_id, o.total
FROM orders o
WHERE EXISTS (
    SELECT 1 FROM order_items oi
    WHERE oi.order_id = o.order_id 
      AND oi.product_id = 100
);
 
-- =========================
-- Pattern 4: Replace Multiple Joins with UNION
-- =========================
 
-- SLOW: Multiple LEFT JOINs to check multiple conditions
SELECT 
    customer_id,
    CASE 
        WHEN e.email_id IS NOT NULL THEN 'Email'
        WHEN p.phone_id IS NOT NULL THEN 'Phone'
        WHEN a.address_id IS NOT NULL THEN 'Mail'
        ELSE 'Unknown'
    END AS contact_method
FROM customers c
LEFT JOIN email_contacts e ON c.customer_id = e.customer_id
LEFT JOIN phone_contacts p ON c.customer_id = p.customer_id
LEFT JOIN address_contacts a ON c.customer_id = a.customer_id;
 
-- BETTER: UNION approach (for large tables)
SELECT customer_id, 'Email' AS contact_method
FROM customers WHERE customer_id IN (SELECT customer_id FROM email_contacts)
UNION ALL
SELECT customer_id, 'Phone' AS contact_method
FROM customers WHERE customer_id IN (SELECT customer_id FROM phone_contacts)
  AND customer_id NOT IN (SELECT customer_id FROM email_contacts)
UNION ALL
-- ... etc.
 
/*
The optimal approach depends on data distribution.
- If most customers have email: LEFT JOIN may be fine
- If contacts are sparse: EXISTS checks may be faster
- Profile both approaches with real data!
*/
 
-- =========================
-- Pattern 5: Rewrite Theta Joins
-- =========================
 
-- SLOW: Range join (non-equi join)
SELECT e.employee_name, s.salary_grade
FROM employees e
JOIN salary_grades s 
    ON e.salary >= s.min_salary AND e.salary <= s.max_salary;
 
/*
Range joins can't use hash or merge joins efficiently.
They often result in nested loops with full scan of one side.
*/
 
-- OPTIMIZATION OPTIONS:
-- 1. Pre-compute join result in application
-- 2. Use bucketing/binning to convert to equi-join
-- 3. Add computed column for grade lookup
 
-- Example: Add precomputed salary grade to employees
ALTER TABLE employees ADD salary_grade_id INT;
 
-- Update using the range logic (one-time or batch)
UPDATE employees e
SET salary_grade_id = (
    SELECT s.grade_id FROM salary_grades s
    WHERE e.salary >= s.min_salary AND e.salary <= s.max_salary
);
 
-- Now queries can use efficient equi-join
SELECT e.employee_name, s.salary_grade
FROM employees e
JOIN salary_grades s ON e.salary_grade_id = s.grade_id;

Join Order Matters

Aggregation Optimization

aggregation_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
-- ================================================================
-- Aggregation Optimization Patterns
-- ================================================================
 
-- =========================
-- Pattern 1: Filter Before Aggregating
-- =========================
 
-- SLOW: Aggregate all, then filter with HAVING
SELECT 
    product_id,
    SUM(quantity) AS total_quantity
FROM order_items
GROUP BY product_id
HAVING SUM(quantity) > 1000;
 
/*
This aggregates ALL products, then discards most results.
If 95% of products have quantity < 1000, wasted work.
*/
 
-- Better approach when possible: Pre-filter if column exists
SELECT 
    product_id,
    SUM(quantity) AS total_quantity
FROM order_items oi
WHERE EXISTS (
    -- Only include products that COULD exceed threshold
    SELECT 1 FROM products p 
    WHERE p.product_id = oi.product_id 
      AND p.is_popular = 1  -- Some pre-filter criterion
)
GROUP BY product_id
HAVING SUM(quantity) > 1000;
 
-- Or maintain running totals separately for large datasets
-- (denormalization for read performance)
 
-- =========================
-- Pattern 2: Use DISTINCT vs GROUP BY Appropriately
-- =========================
 
-- These are semantically equivalent:
SELECT DISTINCT customer_id FROM orders;
SELECT customer_id FROM orders GROUP BY customer_id;
 
/*
Performance may differ:
- DISTINCT: Optimized for exact duplicates
- GROUP BY: Optimized when aggregates are needed
 
Generally, use DISTINCT for simple deduplication,
GROUP BY when calculating aggregates.
Check EXPLAIN to see if optimizer chooses different strategies.
*/
 
-- =========================
-- Pattern 3: Approximate Aggregation for Analytics
-- =========================
 
-- EXACT (potentially expensive for billions of rows)
SELECT COUNT(DISTINCT user_id) AS unique_users
FROM page_views
WHERE event_date >= '2024-01-01';
 
-- APPROXIMATE (much faster for analytics use cases)
-- PostgreSQL uses HyperLogLog internally
SELECT COUNT(DISTINCT user_id) AS unique_users
FROM page_views
WHERE event_date >= '2024-01-01';
 
-- For explicit approximate counting (if supported):
-- SQL Server: APPROX_COUNT_DISTINCT (2019+)
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_unique_users
FROM page_views
WHERE event_date >= '2024-01-01';
 
/*
Approximate functions trade 1-2% accuracy for 10-100x speed.
Excellent for dashboards, analytics, monitoring.
Not appropriate for financial reporting or exact counts.
*/
 
-- =========================
-- Pattern 4: Partial Aggregation with Pre-aggregated Tables
-- =========================
 
-- SLOW: Aggregate from transaction table every time
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(total) AS revenue
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY DATE_TRUNC('month', order_date);
 
-- FAST: Maintain pre-aggregated summary table
-- Daily aggregation table (maintained by batch job or trigger)
CREATE TABLE order_daily_summary (
    summary_date DATE PRIMARY KEY,
    order_count INT,
    total_revenue DECIMAL(18,2)
);
 
-- Query from summary instead
SELECT 
    DATE_TRUNC('month', summary_date) AS month,
    SUM(order_count) AS order_count,
    SUM(total_revenue) AS revenue
FROM order_daily_summary
WHERE summary_date >= '2020-01-01'
GROUP BY DATE_TRUNC('month', summary_date);
 
/*
Pre-aggregation trades storage and update complexity
for dramatically faster reads. Standard practice for:
- Analytics dashboards
- Reporting queries
- Time-series data
 
Consider update strategy: batch, incremental, or trigger-based.
*/
 
-- =========================
-- Pattern 5: Window Functions vs JOIN for Rankings
-- =========================
 
-- SLOW: Self-join for "top per group"
SELECT o.*
FROM orders o
WHERE o.total = (
    SELECT MAX(o2.total) 
    FROM orders o2 
    WHERE o2.customer_id = o.customer_id
);
 
/*
Correlated subquery executes once per row.
For 1M orders, this is slow.
*/
 
-- FAST: Window function (single pass)
SELECT *
FROM (
    SELECT 
        o.*,
        ROW_NUMBER() OVER(
            PARTITION BY customer_id 
            ORDER BY total DESC
        ) AS rn
    FROM orders o
) ranked
WHERE rn = 1;
 
/*
Window function processes all rows in a single pass,
partitioning and ranking efficiently.
Modern approach for any "top N per group" problem.
*/

COUNT(*) vs COUNT(column)

CTE and Derived Table Optimization

cte_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
-- ================================================================
-- CTE and Derived Table Optimization
-- ================================================================
 
-- =========================
-- Understanding CTE Execution Models
-- =========================
 
/*
Database systems handle CTEs differently:
 
PostgreSQL (pre-12): CTEs were ALWAYS materialized
  - Good: Prevents repeated execution
  - Bad: Prevents predicate pushdown
 
PostgreSQL (12+): CTEs are inlined unless:
  - Referenced multiple times
  - Marked with MATERIALIZED hint
  
SQL Server: CTEs are typically inlined
  - Optimizer treats as views
  - May execute multiple times if referenced multiple times
  
MySQL (8+): Similar to SQL Server
  - Inlining is default
  
Oracle: Can be materialized with hints
  - MATERIALIZE or INLINE hints available
*/
 
-- =========================
-- When to Force Materialization
-- =========================
 
-- SCENARIO: CTE used multiple times
WITH expensive_calculation AS (
    SELECT 
        customer_id,
        SUM(total) AS lifetime_value
    FROM orders
    WHERE order_date >= '2020-01-01'
    GROUP BY customer_id
)
SELECT * FROM expensive_calculation WHERE lifetime_value > 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE lifetime_value BETWEEN 5000 AND 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE lifetime_value BETWEEN 1000 AND 5000;
 
/*
Without materialization, the CTE may execute THREE times.
Force materialization when CTE is referenced multiple times.
*/
 
-- PostgreSQL 12+: Force materialization
WITH expensive_calculation AS MATERIALIZED (
    SELECT 
        customer_id,
        SUM(total) AS lifetime_value
    FROM orders
    WHERE order_date >= '2020-01-01'
    GROUP BY customer_id
)
SELECT * FROM expensive_calculation WHERE lifetime_value > 10000
UNION ALL
SELECT * FROM expensive_calculation WHERE lifetime_value BETWEEN 5000 AND 10000;
 
-- SQL Server: Use temp table for explicit materialization
SELECT 
    customer_id,
    SUM(total) AS lifetime_value
INTO #expensive_calc
FROM orders
WHERE order_date >= '2020-01-01'
GROUP BY customer_id;
 
CREATE INDEX idx_temp ON #expensive_calc(lifetime_value);
 
SELECT * FROM #expensive_calc WHERE lifetime_value > 10000
UNION ALL
SELECT * FROM #expensive_calc WHERE lifetime_value BETWEEN 5000 AND 10000;
 
DROP TABLE #expensive_calc;
 
-- =========================
-- When to Force Inlining
-- =========================
 
-- SCENARIO: CTE prevents predicate pushdown
WITH recent_orders AS (
    SELECT * FROM orders  -- All orders selected
)
SELECT * FROM recent_orders 
WHERE customer_id = 12345;
 
/*
In PostgreSQL < 12, this materialized ALL orders,
then filtered. Very inefficient.
 
PostgreSQL 12+ inlines by default, allowing pushdown.
*/
 
-- PostgreSQL 12+: Explicit inline (unnecessary but available)
WITH recent_orders AS NOT MATERIALIZED (
    SELECT * FROM orders
)
SELECT * FROM recent_orders 
WHERE customer_id = 12345;
 
-- Best practice: Write the predicate in the CTE
WITH recent_orders AS (
    SELECT * FROM orders
    WHERE customer_id = 12345  -- Filter pushed into CTE
)
SELECT * FROM recent_orders;
 
-- =========================
-- Recursive CTE Optimization
-- =========================
 
-- Recursive CTEs can be expensive; optimize the anchor and recursive terms
 
-- SLOW: Unoptimized recursive CTE
WITH RECURSIVE org_tree AS (
    -- Anchor: All employees (expensive!)
    SELECT employee_id, manager_id, name, 1 AS level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: Join to find reports
    SELECT e.employee_id, e.manager_id, e.name, ot.level + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree WHERE employee_id = 12345;
 
/*
This builds the ENTIRE org tree, then filters.
If you only need one employee's path, start from there instead.
*/
 
-- FAST: Start from target, walk up
WITH RECURSIVE emp_path AS (
    -- Anchor: Start from specific employee
    SELECT employee_id, manager_id, name, 1 AS level
    FROM employees
    WHERE employee_id = 12345
    
    UNION ALL
    
    -- Recursive: Walk up to manager
    SELECT e.employee_id, e.manager_id, e.name, ep.level + 1
    FROM employees e
    JOIN emp_path ep ON e.employee_id = ep.manager_id
)
SELECT * FROM emp_path;
 
/*
Processes only the path from employee to CEO,
not the entire organization.
*/

Check the Execution Plan

Don't assume CTE behavior—verify with EXPLAIN. Look for "CTE Scan" (materialized) vs. inline evaluation. The optimizer's choice may differ from your expectation, especially after database upgrades.

Verification and Testing

Query rewrites must be verified for both correctness and performance improvement. A systematic testing approach prevents introducing bugs or regressions.

Rewrite Verification Checklist

•Row count match — Original and rewritten queries should return the same number of rows
•Content match — EXCEPT/MINUS between old and new should produce zero rows both ways
•NULL handling — Verify behavior with NULL values in key columns
•Edge cases — Empty result sets, single row, maximum boundary values
•Ordering — If ORDER BY is specified, results should be in identical order
•Performance improvement — Measure with EXPLAIN ANALYZE on representative data
•Plan stability — Verify optimal plan is chosen across different parameter values

rewrite_verification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
-- ================================================================
-- Query Rewrite Verification Techniques
-- ================================================================
 
-- =========================
-- Technique 1: Row Count Comparison
-- =========================
 
-- Count from original
SELECT COUNT(*) AS original_count FROM (
    -- Original query here
    SELECT c.customer_id, c.customer_name,
           (SELECT MAX(o.order_date) FROM orders o 
            WHERE o.customer_id = c.customer_id) AS last_order
    FROM customers c
) original;
 
-- Count from rewritten
SELECT COUNT(*) AS rewritten_count FROM (
    -- Rewritten query here
    SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
) rewritten;
 
-- Should be equal!
 
-- =========================
-- Technique 2: EXCEPT Comparison (Bi-directional)
-- =========================
 
-- Rows in original but not in rewritten (should be empty)
(
    SELECT c.customer_id, c.customer_name,
           (SELECT MAX(o.order_date) FROM orders o 
            WHERE o.customer_id = c.customer_id) AS last_order
    FROM customers c
)
EXCEPT
(
    SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
);
 
-- Rows in rewritten but not in original (should be empty)
(
    SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name
)
EXCEPT
(
    SELECT c.customer_id, c.customer_name,
           (SELECT MAX(o.order_date) FROM orders o 
            WHERE o.customer_id = c.customer_id) AS last_order
    FROM customers c
);
 
-- Both should return 0 rows!
 
-- =========================
-- Technique 3: Performance Comparison
-- =========================
 
-- Original query timing
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT c.customer_id, c.customer_name,
       (SELECT MAX(o.order_date) FROM orders o 
        WHERE o.customer_id = c.customer_id) AS last_order
FROM customers c;
 
/*
Capture:
- Execution time
- Buffer hits/reads
- Rows examined
*/
 
-- Rewritten query timing
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT c.customer_id, c.customer_name, MAX(o.order_date) AS last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
 
/*
Compare metrics:
- Is execution time lower?
- Are buffer reads reduced?
- Is CPU time reduced?
 
Only deploy if clearly better across representative cases.
*/
 
-- =========================
-- Technique 4: Parameter Sensitivity Testing
-- =========================
 
-- Test with different parameters that produce different data volumes
 
-- Small result set parameter
EXPLAIN ANALYZE SELECT ... WHERE customer_id = 1;
 
-- Large result set parameter  
EXPLAIN ANALYZE SELECT ... WHERE customer_id BETWEEN 1 AND 10000;
 
-- NULL parameter (if applicable)
EXPLAIN ANALYZE SELECT ... WHERE customer_id IS NULL;
 
-- Verify optimal plan chosen across parameter variations
 
-- =========================
-- Technique 5: NULL Behavior Verification
-- =========================
 
-- Insert test data with NULLs if not present
-- Then verify both queries handle identically
 
-- Check NULL count matches
SELECT 
    SUM(CASE WHEN last_order IS NULL THEN 1 ELSE 0 END) AS null_count,
    SUM(CASE WHEN last_order IS NOT NULL THEN 1 ELSE 0 END) AS non_null_count
FROM (
    -- Original query
) original;
 
-- Compare to rewritten
SELECT 
    SUM(CASE WHEN last_order IS NULL THEN 1 ELSE 0 END) AS null_count,
    SUM(CASE WHEN last_order IS NOT NULL THEN 1 ELSE 0 END) AS non_null_count
FROM (
    -- Rewritten query
) rewritten;

Summary: Query Rewriting

Key Takeaways

•Rewrite when structure prevents optimization — Indexes can't fix inherently inefficient query patterns.
•Transform subqueries to joins — Correlated subqueries often become efficient JOINs or EXISTS checks.
•Make predicates SARGable — Remove functions from indexed columns; ensure type compatibility.
•Optimize join patterns — Filter early, join late; remove unnecessary joins; reduce join width.
•Pre-aggregate when possible — Summary tables trade storage/complexity for dramatically faster reads.
•Understand CTE behavior — Materialization vs. inlining differs by database and version.
•Always verify rewrites — Correctness first, then performance. Use EXCEPT and count comparisons.

What's next:

Page Complete

4 / 5