Database Management SystemsSQL Query Writing

SQL Query Writing for Interviews

LevelAdvanced

Duration90 mins

TopicSQL Query Writing

5 / 5

Query Optimization

Writing SQL That Performs

Correct SQL that runs slowly is often worthless in production. Query optimization transforms queries that take minutes into queries that complete in milliseconds—the difference between a responsive application and a frustrated user base.

In technical interviews, demonstrating optimization awareness elevates you from 'someone who knows SQL' to 'someone who understands databases.' When asked to write a query, proactively discussing its performance characteristics and potential optimizations shows the depth of understanding that senior engineers possess.

What You Will Learn

By the end of this page, you will understand how databases execute queries and read execution plans, write queries that enable optimizer efficiency, apply indexing strategies that dramatically improve performance, avoid common anti-patterns that kill query speed, and discuss optimization trade-offs intelligently in interviews.

Query Execution Fundamentals

Before optimizing queries, you must understand how databases execute them. Every SQL statement passes through a pipeline that transforms your declarative request into physical operations.

The Query Processing Pipeline:

Query Processing Stages
Stage	Purpose	Output	Optimization Opportunity
Parsing	Validate SQL syntax	Parse tree	None (syntax is syntax)
Binding	Resolve table/column names	Bound tree	None (schema-defined)
Optimization	Find efficient execution plan	Query plan	Statistics, hints, schema design
Execution	Run the chosen plan	Result set	Indexes, memory, parallelism

The Query Optimizer:

The optimizer's job is to find the most efficient execution plan among potentially millions of alternatives. It considers:

Access methods: Full table scan, index scan, index seek
Join algorithms: Nested loop, hash join, merge join
Join order: Which table to process first
Parallelism: Using multiple CPU cores

Cost-Based Optimization:

Modern optimizers are cost-based—they estimate the 'cost' of each possible plan and choose the lowest-cost option. Cost is typically measured in estimated I/O and CPU operations. The optimizer relies on statistics about table sizes, column value distributions, and index characteristics.

Statistics Matter

Stale statistics cause the optimizer to make poor decisions. After significant data changes (bulk loads, mass deletes), update statistics. In PostgreSQL: ANALYZE table_name. In MySQL: ANALYZE TABLE table_name. In SQL Server: UPDATE STATISTICS table_name.

Reading Execution Plans

Execution plans reveal exactly how the database will execute your query. Learning to read them is essential for optimization. Every major database provides EXPLAIN functionality.

explain_queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: EXPLAIN with analyze for actual execution stats
EXPLAIN ANALYZE
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
ORDER BY o.total DESC
LIMIT 10;
 
-- PostgreSQL: EXPLAIN with all options
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 1001;
 
-- MySQL: EXPLAIN with FORMAT
EXPLAIN FORMAT=JSON
SELECT * FROM orders WHERE customer_id = 1001;
 
-- SQL Server: Actual Execution Plan
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT * FROM orders WHERE customer_id = 1001;

Key Plan Elements to Examine:

Execution Plan Components
Component	What It Tells You	Warning Signs
Seq Scan / Table Scan	Reading entire table	Large tables without filter
Index Scan	Reading via index	OK for small result sets
Index Only Scan	All data from index	Optimal—no table access
Nested Loop	Row-by-row join	Inner side lacks index
Hash Join	Build hash, probe	Insufficient memory
Sort	Ordering operation	Large data without index
Rows (estimated vs actual)	Cardinality estimation	Large discrepancy = stale stats

explain_interpretation.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Example PostgreSQL EXPLAIN ANALYZE output:
 
Limit (cost=1000.43..1000.46 rows=10 width=52) 
       (actual time=15.234..15.238 rows=10 loops=1)
  -> Sort (cost=1000.43..1025.43 rows=10000 width=52) 
          (actual time=15.232..15.234 rows=10 loops=1)
        Sort Key: o.total DESC
        Sort Method: top-N heapsort  Memory: 26kB
        -> Hash Join (cost=35.00..766.52 rows=10000 width=52) 
                     (actual time=0.435..12.123 rows=10000 loops=1)
              Hash Cond: (o.customer_id = c.customer_id)
              -> Seq Scan on orders o (cost=0.00..693.00 rows=10000 width=24) 
                                      (actual time=0.015..5.234 rows=10000 loops=1)
                    Filter: (order_date >= '2024-01-01'::date)
                    Rows Removed by Filter: 5000
              -> Hash (cost=22.00..22.00 rows=1000 width=36) 
                       (actual time=0.389..0.390 rows=1000 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 65kB
                    -> Seq Scan on customers c (cost=0.00..22.00 rows=1000 width=36) 
                                               (actual time=0.008..0.189 rows=1000 loops=1)
Planning Time: 0.234 ms
Execution Time: 15.312 ms
 
KEY OBSERVATIONS:
1. Seq Scan on orders with filter - 5000 rows removed
   -> Consider index on (order_date) if this filter is common
   
2. Hash Join chosen - appropriate for this data size
   -> Hash built on smaller table (customers) - correct
 
3. Top-N heapsort used for LIMIT + ORDER BY
   -> Efficient for small N, doesn't sort everything
 
4. actual rows = estimated rows - good statistics
*/
 
-- After adding index, re-run EXPLAIN
CREATE INDEX idx_orders_date ON orders(order_date);
EXPLAIN ANALYZE 
SELECT * FROM orders WHERE order_date >= '2024-01-01';
 
/* With index:
Index Scan using idx_orders_date on orders 
  (cost=0.29..345.23 rows=10000 width=24)
  (actual time=0.015..2.145 rows=10000 loops=1)
  Index Cond: (order_date >= '2024-01-01'::date)
  
MUCH FASTER: Uses index scan instead of seq scan + filter
*/

In Interviews

When writing a query in an interview, mention what the execution plan might look like: 'This join would benefit from an index on customer_id. Without it, we'd see nested loops scanning the full orders table for each customer.' This demonstrates practical optimization knowledge.

Index-Aware Query Writing

The single most impactful optimization is ensuring your queries can use appropriate indexes. Understanding how indexes work enables you to write queries that leverage them effectively.

Index Types and Use Cases:

Common Index Types
Index Type	Best For	Limitation	Example
B-Tree (default)	Equality, range, sorting	Not for pattern prefix	WHERE id = 5, ORDER BY date
Hash	Equality only	No range queries	WHERE status = 'active'
GIN/Full-Text	Text search, arrays	Write overhead	WHERE doc @@ 'database'
GiST	Geometric, range types	Specialized	WHERE point <-> origin < 100
Partial	Filtered subset	Limited applicability	WHERE status = 'pending'
Covering	All columns in query	Storage overhead	Includes all SELECT columns

index_usage.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Index on (customer_id)
-- USES INDEX:
SELECT * FROM orders WHERE customer_id = 1001;
SELECT * FROM orders WHERE customer_id IN (1001, 1002, 1003);
SELECT * FROM orders WHERE customer_id BETWEEN 1000 AND 2000;
 
-- CANNOT USE INDEX (function on column):
SELECT * FROM orders WHERE YEAR(order_date) = 2024;  -- BAD
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' 
  AND order_date < '2025-01-01';  -- GOOD
 
-- CANNOT USE INDEX (implicit conversion):
SELECT * FROM orders WHERE customer_id = '1001';  -- May not use if types differ
 
-- Composite index on (customer_id, order_date)
-- USES INDEX (leftmost prefix):
SELECT * FROM orders WHERE customer_id = 1001;
SELECT * FROM orders WHERE customer_id = 1001 AND order_date = '2024-01-15';
 
-- CANNOT USE INDEX (missing leftmost column):
SELECT * FROM orders WHERE order_date = '2024-01-15';  -- Might skip index
 
-- Index for sorts
CREATE INDEX idx_orders_date ON orders(order_date DESC);
-- USES INDEX for:
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
 
-- Covering index (includes all needed columns)
CREATE INDEX idx_orders_covering 
ON orders(customer_id) INCLUDE (order_date, total, status);
 
-- This query can be satisfied entirely from the index:
SELECT order_date, total, status 
FROM orders 
WHERE customer_id = 1001;
 
-- Partial index (only index needed rows)
CREATE INDEX idx_pending_orders 
ON orders(created_at) 
WHERE status = 'pending';
 
-- Efficient for common filtered query:
SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;

Index Killers

These patterns prevent index usage: (1) Functions on indexed columns (YEAR(date)), (2) Implicit type conversions, (3) OR conditions on different columns, (4) LIKE with leading wildcard ('%text'), (5) NOT IN with nullable subqueries. Rewrite queries to avoid these when possible.

Query Anti-Patterns to Avoid

Certain query patterns consistently cause performance problems. Recognizing and avoiding these anti-patterns is fundamental to writing efficient SQL.

**Anti-Pattern: SELECT ***

Retrieving all columns when you only need a few wastes bandwidth, memory, and prevents covering index optimization:

select_star.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- ANTI-PATTERN: Fetching all columns
SELECT * FROM orders WHERE customer_id = 1001;
-- Fetches: order_id, customer_id, order_date, total, 
--          status, shipping_address, billing_address, 
--          notes, created_at, updated_at, ...
 
-- BETTER: Only needed columns
SELECT order_id, order_date, total, status
FROM orders 
WHERE customer_id = 1001;
 
-- BEST: With covering index, no table access needed
CREATE INDEX idx_orders_customer_covering 
ON orders(customer_id) INCLUDE (order_id, order_date, total, status);
 
SELECT order_id, order_date, total, status
FROM orders 
WHERE customer_id = 1001;
-- Index-only scan: 10x faster

Join Optimization Strategies

Joins are often the most expensive operations in complex queries. Understanding join execution and applying optimization strategies significantly impacts performance.

Join Algorithm Selection:

Join Algorithms
Algorithm	When Chosen	Optimization Strategy
Nested Loop	Small outer table, indexed inner	Index on inner join column
Hash Join	Large unsorted tables, equi-join	Sufficient work_mem/memory
Sort-Merge	Already sorted, or many joins	Pre-sorted data, cluster indexes

join_optimization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
-- Essential: Index on foreign keys
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);
 
-- Filter before joining (reduce join input)
-- WORSE: Filter after join
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
 
-- The optimizer usually pushes this down, but being explicit helps:
-- EXPLICIT: CTE filter before join
WITH recent_orders AS (
    SELECT * FROM orders 
    WHERE order_date >= '2024-01-01'
)
SELECT c.name, ro.order_id
FROM customers c
JOIN recent_orders ro ON c.customer_id = ro.customer_id;
 
-- Avoid joining large intermediate results
-- PROBLEMATIC: Large join, then filter
SELECT COUNT(*)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.category_id = 5;
 
-- BETTER: Filter early, join less
SELECT COUNT(*)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id IN (
    SELECT product_id FROM products WHERE category_id = 5
);
 
-- Or with CTE for clarity:
WITH category_products AS (
    SELECT product_id FROM products WHERE category_id = 5
)
SELECT COUNT(*)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN category_products cp ON oi.product_id = cp.product_id;
 
-- Avoid unnecessary joins
-- UNNECESSARY: Join just to filter
SELECT o.* 
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.region = 'North';
 
-- BETTER: Subquery or semi-join
SELECT * FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE region = 'North'
);
 
-- Or EXISTS:
SELECT o.* FROM orders o
WHERE EXISTS (
    SELECT 1 FROM customers c 
    WHERE c.customer_id = o.customer_id AND c.region = 'North'
);

Join Order Matters

While modern optimizers usually find good join orders, you can help by: (1) Putting the most selective filters first, (2) Joining smaller results before larger ones (conceptually), (3) Using CTEs to materialize intermediate results when beneficial.

Subquery and CTE Optimization

Subqueries and CTEs have different performance characteristics. Understanding when each is preferable helps you write efficient complex queries.

subquery_cte_optimization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Correlated subquery: Can be slow (row-by-row)
SELECT *
FROM products p
WHERE p.unit_price > (
    SELECT AVG(unit_price) FROM products 
    WHERE category_id = p.category_id  -- Correlation
);
 
-- REWRITE as window function (single pass):
SELECT *
FROM (
    SELECT 
        *,
        AVG(unit_price) OVER (PARTITION BY category_id) AS cat_avg
    FROM products
) p
WHERE unit_price > cat_avg;
 
-- REWRITE as JOIN (explicit hash join):
SELECT p.*
FROM products p
JOIN (
    SELECT category_id, AVG(unit_price) AS cat_avg
    FROM products
    GROUP BY category_id
) cat_stats ON p.category_id = cat_stats.category_id
WHERE p.unit_price > cat_stats.cat_avg;
 
-- EXISTS vs IN performance
-- EXISTS short-circuits: stops at first match
SELECT c.* FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id
);
 
-- IN evaluates entire subquery
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
-- Often optimized to same plan, but EXISTS is explicit about intent
 
-- CTE materialization (PostgreSQL)
-- Force materialization (compute once, reuse):
WITH MATERIALIZED expensive_calc AS (
    SELECT customer_id, complex_calculation(data) as result
    FROM big_table
    GROUP BY customer_id
)
SELECT * FROM expensive_calc c1
JOIN expensive_calc c2 ON c1.result = c2.result;
 
-- Inline (evaluate each reference):
WITH NOT MATERIALIZED simple_filter AS (
    SELECT * FROM orders WHERE status = 'active'
)
SELECT * FROM simple_filter WHERE customer_id = 1001;
-- Allows index usage on combined predicates
 
-- Scalar subquery caching
-- Scalar subqueries with same input often cached:
SELECT 
    customer_id,
    (SELECT MAX(order_date) FROM orders WHERE customer_id = c.customer_id) as last_order
FROM customers c;
-- But window function is usually faster:
SELECT DISTINCT ON (customer_id)
    customer_id,
    MAX(order_date) OVER (PARTITION BY customer_id) as last_order
FROM orders;

Subquery Optimization Guidelines

•Prefer JOINs over correlated subqueries when both work—JOINs enable better optimization
•Use EXISTS for existence checks — can short-circuit on first match
•Replace scalar subqueries with window functions when possible
•Materialize expensive CTEs that are referenced multiple times
•Don't materialize simple CTEs — let optimizer inline them
•Filter early in CTEs — compute only necessary rows

Aggregation and Sorting Optimization

Aggregation and sorting are memory-intensive operations. Optimizing these operations is crucial for queries on large datasets.

aggregation_sorting.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
-- Index for ORDER BY
CREATE INDEX idx_orders_date_desc ON orders(order_date DESC);
 
-- Query uses index to avoid sorting:
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
-- "Index Scan" instead of "Sort + Seq Scan"
 
-- LIMIT with ORDER BY optimization
-- GOOD: Database can stop after N rows
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
 
-- PROBLEMATIC: Must process all rows before limiting
SELECT * FROM orders ORDER BY random() LIMIT 10;  -- No early termination
 
-- Approximate distinct counts for large data
-- SLOW: Exact count
SELECT COUNT(DISTINCT customer_id) FROM orders;  -- Processes all rows
 
-- FAST: HyperLogLog approximation (PostgreSQL)
SELECT approx_count_distinct(customer_id) FROM orders;
 
-- FAST: Sampling for estimates
SELECT COUNT(DISTINCT customer_id) * 100 AS estimated_count
FROM orders TABLESAMPLE SYSTEM(1);  -- 1% sample
 
-- Group by optimization with index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- Can use index for GROUP BY:
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id;
 
-- Partial aggregation for distributed queries
-- Pre-aggregate in subquery
SELECT region, SUM(daily_total) AS monthly_total
FROM (
    SELECT region, date, SUM(amount) AS daily_total
    FROM transactions
    GROUP BY region, date
) daily
WHERE date >= '2024-01-01' AND date < '2024-02-01'
GROUP BY region;
 
-- Avoid DISTINCT when GROUP BY suffices
-- SLOWER:
SELECT DISTINCT customer_id FROM orders;
 
-- FASTER (often):
SELECT customer_id FROM orders GROUP BY customer_id;
 
-- HAVING vs WHERE
-- GOOD: Filter before grouping (WHERE)
SELECT customer_id, SUM(total)
FROM orders
WHERE order_date >= '2024-01-01'  -- Filters rows first
GROUP BY customer_id
HAVING SUM(total) > 1000;         -- Filters groups after
 
-- BAD: Filtering in HAVING when WHERE works
SELECT customer_id, SUM(total)
FROM orders
GROUP BY customer_id
HAVING SUM(total) > 1000 AND order_date >= '2024-01-01';  -- Error or wrong

Optimization in Interviews

In interviews, optimization discussions demonstrate depth. Here's how to approach them effectively:

When to Bring Up Optimization:

After writing a correct solution: "This works, but for production scale..."
When the problem involves large data: "With millions of rows, I'd consider..."
When asked "How would you improve this?"
When explaining design decisions: "I chose EXISTS over IN because..."

interview_optimization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Interview Question: 
   "Find the top 5 products by revenue for each category"
   
   First, write a correct solution: */
 
WITH product_revenue AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category_id,
        SUM(oi.quantity * oi.unit_price) AS revenue
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id, p.product_name, p.category_id
),
ranked AS (
    SELECT 
        *,
        ROW_NUMBER() OVER (
            PARTITION BY category_id 
            ORDER BY revenue DESC
        ) AS rank
    FROM product_revenue
)
SELECT * FROM ranked WHERE rank <= 5;
 
/* Then discuss optimization:
 
"For this query at scale, I would consider:
 
1. INDEXING:
   - Index on order_items(product_id) for the join
   - Potentially a covering index: order_items(product_id) 
     INCLUDE (quantity, unit_price)
   
2. MATERIALIZATION:
   - If this runs frequently, a materialized view of 
     product_revenue, refreshed periodically
   
3. PARTITIONING:
   - If order_items is very large, partition by date
     and only aggregate recent periods
   
4. APPROXIMATE:
   - For dashboards, could use sampling for near-real-time
     estimates instead of exact calculations
 
5. OBSERVE THE PLAN:
   - Run EXPLAIN ANALYZE to verify join strategy
   - Ensure the window function doesn't spill to disk
   - Check if category_id index helps the partition"
*/

Optimization Talking Points

•Indexes — "This benefits from an index on [column] for the [WHERE/JOIN/ORDER BY]"
•Execution Plan — "I'd run EXPLAIN to verify the optimizer's using the index"
•Statistics — "Accurate statistics are important for the optimizer's cardinality estimates"
•Scale — "At [X] scale, I might consider [partitioning/materialization/approximation]"
•Trade-offs — "This optimizes for reads at the cost of write performance"
•Alternatives — "An alternative approach using [technique] might be faster because..."

Don't Over-Optimize in Interviews

First write a correct, readable solution. Premature optimization distracts from demonstrating SQL knowledge. Mention optimizations after correctness is established, or when specifically asked. 'Works correctly' beats 'faster but wrong' every time.

Summary: Query Optimization Mastery

You've completed the SQL Query Writing module with a strong foundation in query optimization—the final piece that transforms SQL knowledge into practical database expertise.

Key Takeaways:

Core Concepts Mastered

•Query Processing — Understand parsing, binding, optimization, and execution stages for informed optimization.
•Execution Plans — Read EXPLAIN output to understand and verify how queries execute.
•Index Strategy — Design indexes for your query patterns; understand what enables and prevents index usage.
•Anti-Patterns — Avoid SELECT *, N+1 queries, row-by-row processing, and implicit type conversions.
•Join Optimization — Choose appropriate join algorithms, filter early, index foreign keys.
•Subquery Efficiency — Prefer window functions over correlated subqueries; understand CTE materialization.
•Interview Approach — Correctness first, then optimization discussion to demonstrate depth.

Module Complete:

You've now mastered the full spectrum of SQL query writing—from complex query construction and multi-table joins, through subqueries and window functions, to query optimization. These skills prepare you to write sophisticated, efficient SQL in any technical interview or production environment.

Module Complete: SQL Query Writing

Congratulations! You now possess the complete SQL toolkit for technical interviews: complex queries, joins, subqueries, window functions, and optimization awareness. This combination of correctness and performance understanding is exactly what interviewers seek in senior database engineers.

5 / 5

Loading learning content...

Database Management SystemsSQL Query Writing

SQL Query Writing for Interviews

LevelAdvanced

Duration90 mins

TopicSQL Query Writing

5 / 5

Query Optimization

Writing SQL That Performs

What You Will Learn

Query Execution Fundamentals

Before optimizing queries, you must understand how databases execute them. Every SQL statement passes through a pipeline that transforms your declarative request into physical operations.

The Query Processing Pipeline:

Query Processing Stages
Stage	Purpose	Output	Optimization Opportunity
Parsing	Validate SQL syntax	Parse tree	None (syntax is syntax)
Binding	Resolve table/column names	Bound tree	None (schema-defined)
Optimization	Find efficient execution plan	Query plan	Statistics, hints, schema design
Execution	Run the chosen plan	Result set	Indexes, memory, parallelism

The Query Optimizer:

The optimizer's job is to find the most efficient execution plan among potentially millions of alternatives. It considers:

Access methods: Full table scan, index scan, index seek
Join algorithms: Nested loop, hash join, merge join
Join order: Which table to process first
Parallelism: Using multiple CPU cores

Cost-Based Optimization:

Statistics Matter

Reading Execution Plans

Execution plans reveal exactly how the database will execute your query. Learning to read them is essential for optimization. Every major database provides EXPLAIN functionality.

explain_queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: EXPLAIN with analyze for actual execution stats
EXPLAIN ANALYZE
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
ORDER BY o.total DESC
LIMIT 10;
 
-- PostgreSQL: EXPLAIN with all options
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 1001;
 
-- MySQL: EXPLAIN with FORMAT
EXPLAIN FORMAT=JSON
SELECT * FROM orders WHERE customer_id = 1001;
 
-- SQL Server: Actual Execution Plan
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT * FROM orders WHERE customer_id = 1001;

Key Plan Elements to Examine:

Execution Plan Components
Component	What It Tells You	Warning Signs
Seq Scan / Table Scan	Reading entire table	Large tables without filter
Index Scan	Reading via index	OK for small result sets
Index Only Scan	All data from index	Optimal—no table access
Nested Loop	Row-by-row join	Inner side lacks index
Hash Join	Build hash, probe	Insufficient memory
Sort	Ordering operation	Large data without index
Rows (estimated vs actual)	Cardinality estimation	Large discrepancy = stale stats

explain_interpretation.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Example PostgreSQL EXPLAIN ANALYZE output:
 
Limit (cost=1000.43..1000.46 rows=10 width=52) 
       (actual time=15.234..15.238 rows=10 loops=1)
  -> Sort (cost=1000.43..1025.43 rows=10000 width=52) 
          (actual time=15.232..15.234 rows=10 loops=1)
        Sort Key: o.total DESC
        Sort Method: top-N heapsort  Memory: 26kB
        -> Hash Join (cost=35.00..766.52 rows=10000 width=52) 
                     (actual time=0.435..12.123 rows=10000 loops=1)
              Hash Cond: (o.customer_id = c.customer_id)
              -> Seq Scan on orders o (cost=0.00..693.00 rows=10000 width=24) 
                                      (actual time=0.015..5.234 rows=10000 loops=1)
                    Filter: (order_date >= '2024-01-01'::date)
                    Rows Removed by Filter: 5000
              -> Hash (cost=22.00..22.00 rows=1000 width=36) 
                       (actual time=0.389..0.390 rows=1000 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 65kB
                    -> Seq Scan on customers c (cost=0.00..22.00 rows=1000 width=36) 
                                               (actual time=0.008..0.189 rows=1000 loops=1)
Planning Time: 0.234 ms
Execution Time: 15.312 ms
 
KEY OBSERVATIONS:
1. Seq Scan on orders with filter - 5000 rows removed
   -> Consider index on (order_date) if this filter is common
   
2. Hash Join chosen - appropriate for this data size
   -> Hash built on smaller table (customers) - correct
 
3. Top-N heapsort used for LIMIT + ORDER BY
   -> Efficient for small N, doesn't sort everything
 
4. actual rows = estimated rows - good statistics
*/
 
-- After adding index, re-run EXPLAIN
CREATE INDEX idx_orders_date ON orders(order_date);
EXPLAIN ANALYZE 
SELECT * FROM orders WHERE order_date >= '2024-01-01';
 
/* With index:
Index Scan using idx_orders_date on orders 
  (cost=0.29..345.23 rows=10000 width=24)
  (actual time=0.015..2.145 rows=10000 loops=1)
  Index Cond: (order_date >= '2024-01-01'::date)
  
MUCH FASTER: Uses index scan instead of seq scan + filter
*/

In Interviews

Index-Aware Query Writing

The single most impactful optimization is ensuring your queries can use appropriate indexes. Understanding how indexes work enables you to write queries that leverage them effectively.

Index Types and Use Cases:

Common Index Types
Index Type	Best For	Limitation	Example
B-Tree (default)	Equality, range, sorting	Not for pattern prefix	WHERE id = 5, ORDER BY date
Hash	Equality only	No range queries	WHERE status = 'active'
GIN/Full-Text	Text search, arrays	Write overhead	WHERE doc @@ 'database'
GiST	Geometric, range types	Specialized	WHERE point <-> origin < 100
Partial	Filtered subset	Limited applicability	WHERE status = 'pending'
Covering	All columns in query	Storage overhead	Includes all SELECT columns

index_usage.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Index on (customer_id)
-- USES INDEX:
SELECT * FROM orders WHERE customer_id = 1001;
SELECT * FROM orders WHERE customer_id IN (1001, 1002, 1003);
SELECT * FROM orders WHERE customer_id BETWEEN 1000 AND 2000;
 
-- CANNOT USE INDEX (function on column):
SELECT * FROM orders WHERE YEAR(order_date) = 2024;  -- BAD
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' 
  AND order_date < '2025-01-01';  -- GOOD
 
-- CANNOT USE INDEX (implicit conversion):
SELECT * FROM orders WHERE customer_id = '1001';  -- May not use if types differ
 
-- Composite index on (customer_id, order_date)
-- USES INDEX (leftmost prefix):
SELECT * FROM orders WHERE customer_id = 1001;
SELECT * FROM orders WHERE customer_id = 1001 AND order_date = '2024-01-15';
 
-- CANNOT USE INDEX (missing leftmost column):
SELECT * FROM orders WHERE order_date = '2024-01-15';  -- Might skip index
 
-- Index for sorts
CREATE INDEX idx_orders_date ON orders(order_date DESC);
-- USES INDEX for:
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
 
-- Covering index (includes all needed columns)
CREATE INDEX idx_orders_covering 
ON orders(customer_id) INCLUDE (order_date, total, status);
 
-- This query can be satisfied entirely from the index:
SELECT order_date, total, status 
FROM orders 
WHERE customer_id = 1001;
 
-- Partial index (only index needed rows)
CREATE INDEX idx_pending_orders 
ON orders(created_at) 
WHERE status = 'pending';
 
-- Efficient for common filtered query:
SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;

Index Killers

Query Anti-Patterns to Avoid

Certain query patterns consistently cause performance problems. Recognizing and avoiding these anti-patterns is fundamental to writing efficient SQL.

**Anti-Pattern: SELECT ***

Retrieving all columns when you only need a few wastes bandwidth, memory, and prevents covering index optimization:

select_star.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- ANTI-PATTERN: Fetching all columns
SELECT * FROM orders WHERE customer_id = 1001;
-- Fetches: order_id, customer_id, order_date, total, 
--          status, shipping_address, billing_address, 
--          notes, created_at, updated_at, ...
 
-- BETTER: Only needed columns
SELECT order_id, order_date, total, status
FROM orders 
WHERE customer_id = 1001;
 
-- BEST: With covering index, no table access needed
CREATE INDEX idx_orders_customer_covering 
ON orders(customer_id) INCLUDE (order_id, order_date, total, status);
 
SELECT order_id, order_date, total, status
FROM orders 
WHERE customer_id = 1001;
-- Index-only scan: 10x faster

Join Optimization Strategies

Joins are often the most expensive operations in complex queries. Understanding join execution and applying optimization strategies significantly impacts performance.

Join Algorithm Selection:

Join Algorithms
Algorithm	When Chosen	Optimization Strategy
Nested Loop	Small outer table, indexed inner	Index on inner join column
Hash Join	Large unsorted tables, equi-join	Sufficient work_mem/memory
Sort-Merge	Already sorted, or many joins	Pre-sorted data, cluster indexes

join_optimization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
-- Essential: Index on foreign keys
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);
 
-- Filter before joining (reduce join input)
-- WORSE: Filter after join
SELECT c.name, o.order_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
 
-- The optimizer usually pushes this down, but being explicit helps:
-- EXPLICIT: CTE filter before join
WITH recent_orders AS (
    SELECT * FROM orders 
    WHERE order_date >= '2024-01-01'
)
SELECT c.name, ro.order_id
FROM customers c
JOIN recent_orders ro ON c.customer_id = ro.customer_id;
 
-- Avoid joining large intermediate results
-- PROBLEMATIC: Large join, then filter
SELECT COUNT(*)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.category_id = 5;
 
-- BETTER: Filter early, join less
SELECT COUNT(*)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id IN (
    SELECT product_id FROM products WHERE category_id = 5
);
 
-- Or with CTE for clarity:
WITH category_products AS (
    SELECT product_id FROM products WHERE category_id = 5
)
SELECT COUNT(*)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN category_products cp ON oi.product_id = cp.product_id;
 
-- Avoid unnecessary joins
-- UNNECESSARY: Join just to filter
SELECT o.* 
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.region = 'North';
 
-- BETTER: Subquery or semi-join
SELECT * FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE region = 'North'
);
 
-- Or EXISTS:
SELECT o.* FROM orders o
WHERE EXISTS (
    SELECT 1 FROM customers c 
    WHERE c.customer_id = o.customer_id AND c.region = 'North'
);

Join Order Matters

Subquery and CTE Optimization

Subqueries and CTEs have different performance characteristics. Understanding when each is preferable helps you write efficient complex queries.

subquery_cte_optimization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Correlated subquery: Can be slow (row-by-row)
SELECT *
FROM products p
WHERE p.unit_price > (
    SELECT AVG(unit_price) FROM products 
    WHERE category_id = p.category_id  -- Correlation
);
 
-- REWRITE as window function (single pass):
SELECT *
FROM (
    SELECT 
        *,
        AVG(unit_price) OVER (PARTITION BY category_id) AS cat_avg
    FROM products
) p
WHERE unit_price > cat_avg;
 
-- REWRITE as JOIN (explicit hash join):
SELECT p.*
FROM products p
JOIN (
    SELECT category_id, AVG(unit_price) AS cat_avg
    FROM products
    GROUP BY category_id
) cat_stats ON p.category_id = cat_stats.category_id
WHERE p.unit_price > cat_stats.cat_avg;
 
-- EXISTS vs IN performance
-- EXISTS short-circuits: stops at first match
SELECT c.* FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id
);
 
-- IN evaluates entire subquery
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
-- Often optimized to same plan, but EXISTS is explicit about intent
 
-- CTE materialization (PostgreSQL)
-- Force materialization (compute once, reuse):
WITH MATERIALIZED expensive_calc AS (
    SELECT customer_id, complex_calculation(data) as result
    FROM big_table
    GROUP BY customer_id
)
SELECT * FROM expensive_calc c1
JOIN expensive_calc c2 ON c1.result = c2.result;
 
-- Inline (evaluate each reference):
WITH NOT MATERIALIZED simple_filter AS (
    SELECT * FROM orders WHERE status = 'active'
)
SELECT * FROM simple_filter WHERE customer_id = 1001;
-- Allows index usage on combined predicates
 
-- Scalar subquery caching
-- Scalar subqueries with same input often cached:
SELECT 
    customer_id,
    (SELECT MAX(order_date) FROM orders WHERE customer_id = c.customer_id) as last_order
FROM customers c;
-- But window function is usually faster:
SELECT DISTINCT ON (customer_id)
    customer_id,
    MAX(order_date) OVER (PARTITION BY customer_id) as last_order
FROM orders;

Subquery Optimization Guidelines

•Prefer JOINs over correlated subqueries when both work—JOINs enable better optimization
•Use EXISTS for existence checks — can short-circuit on first match
•Replace scalar subqueries with window functions when possible
•Materialize expensive CTEs that are referenced multiple times
•Don't materialize simple CTEs — let optimizer inline them
•Filter early in CTEs — compute only necessary rows

Aggregation and Sorting Optimization

Aggregation and sorting are memory-intensive operations. Optimizing these operations is crucial for queries on large datasets.

aggregation_sorting.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
-- Index for ORDER BY
CREATE INDEX idx_orders_date_desc ON orders(order_date DESC);
 
-- Query uses index to avoid sorting:
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
-- "Index Scan" instead of "Sort + Seq Scan"
 
-- LIMIT with ORDER BY optimization
-- GOOD: Database can stop after N rows
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;
 
-- PROBLEMATIC: Must process all rows before limiting
SELECT * FROM orders ORDER BY random() LIMIT 10;  -- No early termination
 
-- Approximate distinct counts for large data
-- SLOW: Exact count
SELECT COUNT(DISTINCT customer_id) FROM orders;  -- Processes all rows
 
-- FAST: HyperLogLog approximation (PostgreSQL)
SELECT approx_count_distinct(customer_id) FROM orders;
 
-- FAST: Sampling for estimates
SELECT COUNT(DISTINCT customer_id) * 100 AS estimated_count
FROM orders TABLESAMPLE SYSTEM(1);  -- 1% sample
 
-- Group by optimization with index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- Can use index for GROUP BY:
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id;
 
-- Partial aggregation for distributed queries
-- Pre-aggregate in subquery
SELECT region, SUM(daily_total) AS monthly_total
FROM (
    SELECT region, date, SUM(amount) AS daily_total
    FROM transactions
    GROUP BY region, date
) daily
WHERE date >= '2024-01-01' AND date < '2024-02-01'
GROUP BY region;
 
-- Avoid DISTINCT when GROUP BY suffices
-- SLOWER:
SELECT DISTINCT customer_id FROM orders;
 
-- FASTER (often):
SELECT customer_id FROM orders GROUP BY customer_id;
 
-- HAVING vs WHERE
-- GOOD: Filter before grouping (WHERE)
SELECT customer_id, SUM(total)
FROM orders
WHERE order_date >= '2024-01-01'  -- Filters rows first
GROUP BY customer_id
HAVING SUM(total) > 1000;         -- Filters groups after
 
-- BAD: Filtering in HAVING when WHERE works
SELECT customer_id, SUM(total)
FROM orders
GROUP BY customer_id
HAVING SUM(total) > 1000 AND order_date >= '2024-01-01';  -- Error or wrong

Optimization in Interviews

In interviews, optimization discussions demonstrate depth. Here's how to approach them effectively:

When to Bring Up Optimization:

After writing a correct solution: "This works, but for production scale..."
When the problem involves large data: "With millions of rows, I'd consider..."
When asked "How would you improve this?"
When explaining design decisions: "I chose EXISTS over IN because..."

interview_optimization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Interview Question: 
   "Find the top 5 products by revenue for each category"
   
   First, write a correct solution: */
 
WITH product_revenue AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category_id,
        SUM(oi.quantity * oi.unit_price) AS revenue
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id, p.product_name, p.category_id
),
ranked AS (
    SELECT 
        *,
        ROW_NUMBER() OVER (
            PARTITION BY category_id 
            ORDER BY revenue DESC
        ) AS rank
    FROM product_revenue
)
SELECT * FROM ranked WHERE rank <= 5;
 
/* Then discuss optimization:
 
"For this query at scale, I would consider:
 
1. INDEXING:
   - Index on order_items(product_id) for the join
   - Potentially a covering index: order_items(product_id) 
     INCLUDE (quantity, unit_price)
   
2. MATERIALIZATION:
   - If this runs frequently, a materialized view of 
     product_revenue, refreshed periodically
   
3. PARTITIONING:
   - If order_items is very large, partition by date
     and only aggregate recent periods
   
4. APPROXIMATE:
   - For dashboards, could use sampling for near-real-time
     estimates instead of exact calculations
 
5. OBSERVE THE PLAN:
   - Run EXPLAIN ANALYZE to verify join strategy
   - Ensure the window function doesn't spill to disk
   - Check if category_id index helps the partition"
*/

Optimization Talking Points

•Indexes — "This benefits from an index on [column] for the [WHERE/JOIN/ORDER BY]"
•Execution Plan — "I'd run EXPLAIN to verify the optimizer's using the index"
•Statistics — "Accurate statistics are important for the optimizer's cardinality estimates"
•Scale — "At [X] scale, I might consider [partitioning/materialization/approximation]"
•Trade-offs — "This optimizes for reads at the cost of write performance"
•Alternatives — "An alternative approach using [technique] might be faster because..."

Don't Over-Optimize in Interviews

Summary: Query Optimization Mastery

You've completed the SQL Query Writing module with a strong foundation in query optimization—the final piece that transforms SQL knowledge into practical database expertise.

Key Takeaways:

Core Concepts Mastered

•Query Processing — Understand parsing, binding, optimization, and execution stages for informed optimization.
•Execution Plans — Read EXPLAIN output to understand and verify how queries execute.
•Index Strategy — Design indexes for your query patterns; understand what enables and prevents index usage.
•Anti-Patterns — Avoid SELECT *, N+1 queries, row-by-row processing, and implicit type conversions.
•Join Optimization — Choose appropriate join algorithms, filter early, index foreign keys.
•Subquery Efficiency — Prefer window functions over correlated subqueries; understand CTE materialization.
•Interview Approach — Correctness first, then optimization discussion to demonstrate depth.

Module Complete:

Module Complete: SQL Query Writing

5 / 5