Loading learning content...
Correct SQL that runs slowly is often worthless in production. Query optimization transforms queries that take minutes into queries that complete in milliseconds—the difference between a responsive application and a frustrated user base.
In technical interviews, demonstrating optimization awareness elevates you from 'someone who knows SQL' to 'someone who understands databases.' When asked to write a query, proactively discussing its performance characteristics and potential optimizations shows the depth of understanding that senior engineers possess.
By the end of this page, you will understand how databases execute queries and read execution plans, write queries that enable optimizer efficiency, apply indexing strategies that dramatically improve performance, avoid common anti-patterns that kill query speed, and discuss optimization trade-offs intelligently in interviews.
Before optimizing queries, you must understand how databases execute them. Every SQL statement passes through a pipeline that transforms your declarative request into physical operations.
The Query Processing Pipeline:
| Stage | Purpose | Output | Optimization Opportunity |
|---|---|---|---|
| Parsing | Validate SQL syntax | Parse tree | None (syntax is syntax) |
| Binding | Resolve table/column names | Bound tree | None (schema-defined) |
| Optimization | Find efficient execution plan | Query plan | Statistics, hints, schema design |
| Execution | Run the chosen plan | Result set | Indexes, memory, parallelism |
The Query Optimizer:
The optimizer's job is to find the most efficient execution plan among potentially millions of alternatives. It considers:
Cost-Based Optimization:
Modern optimizers are cost-based—they estimate the 'cost' of each possible plan and choose the lowest-cost option. Cost is typically measured in estimated I/O and CPU operations. The optimizer relies on statistics about table sizes, column value distributions, and index characteristics.
Stale statistics cause the optimizer to make poor decisions. After significant data changes (bulk loads, mass deletes), update statistics. In PostgreSQL: ANALYZE table_name. In MySQL: ANALYZE TABLE table_name. In SQL Server: UPDATE STATISTICS table_name.
Execution plans reveal exactly how the database will execute your query. Learning to read them is essential for optimization. Every major database provides EXPLAIN functionality.
123456789101112131415161718192021
-- PostgreSQL: EXPLAIN with analyze for actual execution statsEXPLAIN ANALYZESELECT c.name, o.order_id, o.totalFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01'ORDER BY o.total DESCLIMIT 10; -- PostgreSQL: EXPLAIN with all optionsEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM orders WHERE customer_id = 1001; -- MySQL: EXPLAIN with FORMATEXPLAIN FORMAT=JSONSELECT * FROM orders WHERE customer_id = 1001; -- SQL Server: Actual Execution PlanSET STATISTICS IO ON;SET STATISTICS TIME ON;SELECT * FROM orders WHERE customer_id = 1001;Key Plan Elements to Examine:
| Component | What It Tells You | Warning Signs |
|---|---|---|
| Seq Scan / Table Scan | Reading entire table | Large tables without filter |
| Index Scan | Reading via index | OK for small result sets |
| Index Only Scan | All data from index | Optimal—no table access |
| Nested Loop | Row-by-row join | Inner side lacks index |
| Hash Join | Build hash, probe | Insufficient memory |
| Sort | Ordering operation | Large data without index |
| Rows (estimated vs actual) | Cardinality estimation | Large discrepancy = stale stats |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
/* Example PostgreSQL EXPLAIN ANALYZE output: Limit (cost=1000.43..1000.46 rows=10 width=52) (actual time=15.234..15.238 rows=10 loops=1) -> Sort (cost=1000.43..1025.43 rows=10000 width=52) (actual time=15.232..15.234 rows=10 loops=1) Sort Key: o.total DESC Sort Method: top-N heapsort Memory: 26kB -> Hash Join (cost=35.00..766.52 rows=10000 width=52) (actual time=0.435..12.123 rows=10000 loops=1) Hash Cond: (o.customer_id = c.customer_id) -> Seq Scan on orders o (cost=0.00..693.00 rows=10000 width=24) (actual time=0.015..5.234 rows=10000 loops=1) Filter: (order_date >= '2024-01-01'::date) Rows Removed by Filter: 5000 -> Hash (cost=22.00..22.00 rows=1000 width=36) (actual time=0.389..0.390 rows=1000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 65kB -> Seq Scan on customers c (cost=0.00..22.00 rows=1000 width=36) (actual time=0.008..0.189 rows=1000 loops=1)Planning Time: 0.234 msExecution Time: 15.312 ms KEY OBSERVATIONS:1. Seq Scan on orders with filter - 5000 rows removed -> Consider index on (order_date) if this filter is common 2. Hash Join chosen - appropriate for this data size -> Hash built on smaller table (customers) - correct 3. Top-N heapsort used for LIMIT + ORDER BY -> Efficient for small N, doesn't sort everything 4. actual rows = estimated rows - good statistics*/ -- After adding index, re-run EXPLAINCREATE INDEX idx_orders_date ON orders(order_date);EXPLAIN ANALYZE SELECT * FROM orders WHERE order_date >= '2024-01-01'; /* With index:Index Scan using idx_orders_date on orders (cost=0.29..345.23 rows=10000 width=24) (actual time=0.015..2.145 rows=10000 loops=1) Index Cond: (order_date >= '2024-01-01'::date) MUCH FASTER: Uses index scan instead of seq scan + filter*/When writing a query in an interview, mention what the execution plan might look like: 'This join would benefit from an index on customer_id. Without it, we'd see nested loops scanning the full orders table for each customer.' This demonstrates practical optimization knowledge.
The single most impactful optimization is ensuring your queries can use appropriate indexes. Understanding how indexes work enables you to write queries that leverage them effectively.
Index Types and Use Cases:
| Index Type | Best For | Limitation | Example |
|---|---|---|---|
| B-Tree (default) | Equality, range, sorting | Not for pattern prefix | WHERE id = 5, ORDER BY date |
| Hash | Equality only | No range queries | WHERE status = 'active' |
| GIN/Full-Text | Text search, arrays | Write overhead | WHERE doc @@ 'database' |
| GiST | Geometric, range types | Specialized | WHERE point <-> origin < 100 |
| Partial | Filtered subset | Limited applicability | WHERE status = 'pending' |
| Covering | All columns in query | Storage overhead | Includes all SELECT columns |
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- Index on (customer_id)-- USES INDEX:SELECT * FROM orders WHERE customer_id = 1001;SELECT * FROM orders WHERE customer_id IN (1001, 1002, 1003);SELECT * FROM orders WHERE customer_id BETWEEN 1000 AND 2000; -- CANNOT USE INDEX (function on column):SELECT * FROM orders WHERE YEAR(order_date) = 2024; -- BADSELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'; -- GOOD -- CANNOT USE INDEX (implicit conversion):SELECT * FROM orders WHERE customer_id = '1001'; -- May not use if types differ -- Composite index on (customer_id, order_date)-- USES INDEX (leftmost prefix):SELECT * FROM orders WHERE customer_id = 1001;SELECT * FROM orders WHERE customer_id = 1001 AND order_date = '2024-01-15'; -- CANNOT USE INDEX (missing leftmost column):SELECT * FROM orders WHERE order_date = '2024-01-15'; -- Might skip index -- Index for sortsCREATE INDEX idx_orders_date ON orders(order_date DESC);-- USES INDEX for:SELECT * FROM orders ORDER BY order_date DESC LIMIT 10; -- Covering index (includes all needed columns)CREATE INDEX idx_orders_covering ON orders(customer_id) INCLUDE (order_date, total, status); -- This query can be satisfied entirely from the index:SELECT order_date, total, status FROM orders WHERE customer_id = 1001; -- Partial index (only index needed rows)CREATE INDEX idx_pending_orders ON orders(created_at) WHERE status = 'pending'; -- Efficient for common filtered query:SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;These patterns prevent index usage: (1) Functions on indexed columns (YEAR(date)), (2) Implicit type conversions, (3) OR conditions on different columns, (4) LIKE with leading wildcard ('%text'), (5) NOT IN with nullable subqueries. Rewrite queries to avoid these when possible.
Certain query patterns consistently cause performance problems. Recognizing and avoiding these anti-patterns is fundamental to writing efficient SQL.
**Anti-Pattern: SELECT ***
Retrieving all columns when you only need a few wastes bandwidth, memory, and prevents covering index optimization:
12345678910111213141516171819
-- ANTI-PATTERN: Fetching all columnsSELECT * FROM orders WHERE customer_id = 1001;-- Fetches: order_id, customer_id, order_date, total, -- status, shipping_address, billing_address, -- notes, created_at, updated_at, ... -- BETTER: Only needed columnsSELECT order_id, order_date, total, statusFROM orders WHERE customer_id = 1001; -- BEST: With covering index, no table access neededCREATE INDEX idx_orders_customer_covering ON orders(customer_id) INCLUDE (order_id, order_date, total, status); SELECT order_id, order_date, total, statusFROM orders WHERE customer_id = 1001;-- Index-only scan: 10x fasterJoins are often the most expensive operations in complex queries. Understanding join execution and applying optimization strategies significantly impacts performance.
Join Algorithm Selection:
| Algorithm | When Chosen | Optimization Strategy |
|---|---|---|
| Nested Loop | Small outer table, indexed inner | Index on inner join column |
| Hash Join | Large unsorted tables, equi-join | Sufficient work_mem/memory |
| Sort-Merge | Already sorted, or many joins | Pre-sorted data, cluster indexes |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
-- Essential: Index on foreign keysCREATE INDEX idx_orders_customer ON orders(customer_id);CREATE INDEX idx_order_items_order ON order_items(order_id);CREATE INDEX idx_order_items_product ON order_items(product_id); -- Filter before joining (reduce join input)-- WORSE: Filter after joinSELECT c.name, o.order_idFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01'; -- The optimizer usually pushes this down, but being explicit helps:-- EXPLICIT: CTE filter before joinWITH recent_orders AS ( SELECT * FROM orders WHERE order_date >= '2024-01-01')SELECT c.name, ro.order_idFROM customers cJOIN recent_orders ro ON c.customer_id = ro.customer_id; -- Avoid joining large intermediate results-- PROBLEMATIC: Large join, then filterSELECT COUNT(*)FROM orders oJOIN order_items oi ON o.order_id = oi.order_idJOIN products p ON oi.product_id = p.product_idWHERE p.category_id = 5; -- BETTER: Filter early, join lessSELECT COUNT(*)FROM orders oJOIN order_items oi ON o.order_id = oi.order_idWHERE oi.product_id IN ( SELECT product_id FROM products WHERE category_id = 5); -- Or with CTE for clarity:WITH category_products AS ( SELECT product_id FROM products WHERE category_id = 5)SELECT COUNT(*)FROM orders oJOIN order_items oi ON o.order_id = oi.order_idJOIN category_products cp ON oi.product_id = cp.product_id; -- Avoid unnecessary joins-- UNNECESSARY: Join just to filterSELECT o.* FROM orders oJOIN customers c ON o.customer_id = c.customer_idWHERE c.region = 'North'; -- BETTER: Subquery or semi-joinSELECT * FROM ordersWHERE customer_id IN ( SELECT customer_id FROM customers WHERE region = 'North'); -- Or EXISTS:SELECT o.* FROM orders oWHERE EXISTS ( SELECT 1 FROM customers c WHERE c.customer_id = o.customer_id AND c.region = 'North');While modern optimizers usually find good join orders, you can help by: (1) Putting the most selective filters first, (2) Joining smaller results before larger ones (conceptually), (3) Using CTEs to materialize intermediate results when beneficial.
Subqueries and CTEs have different performance characteristics. Understanding when each is preferable helps you write efficient complex queries.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
-- Correlated subquery: Can be slow (row-by-row)SELECT *FROM products pWHERE p.unit_price > ( SELECT AVG(unit_price) FROM products WHERE category_id = p.category_id -- Correlation); -- REWRITE as window function (single pass):SELECT *FROM ( SELECT *, AVG(unit_price) OVER (PARTITION BY category_id) AS cat_avg FROM products) pWHERE unit_price > cat_avg; -- REWRITE as JOIN (explicit hash join):SELECT p.*FROM products pJOIN ( SELECT category_id, AVG(unit_price) AS cat_avg FROM products GROUP BY category_id) cat_stats ON p.category_id = cat_stats.category_idWHERE p.unit_price > cat_stats.cat_avg; -- EXISTS vs IN performance-- EXISTS short-circuits: stops at first matchSELECT c.* FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id); -- IN evaluates entire subquerySELECT * FROM customersWHERE customer_id IN (SELECT customer_id FROM orders);-- Often optimized to same plan, but EXISTS is explicit about intent -- CTE materialization (PostgreSQL)-- Force materialization (compute once, reuse):WITH MATERIALIZED expensive_calc AS ( SELECT customer_id, complex_calculation(data) as result FROM big_table GROUP BY customer_id)SELECT * FROM expensive_calc c1JOIN expensive_calc c2 ON c1.result = c2.result; -- Inline (evaluate each reference):WITH NOT MATERIALIZED simple_filter AS ( SELECT * FROM orders WHERE status = 'active')SELECT * FROM simple_filter WHERE customer_id = 1001;-- Allows index usage on combined predicates -- Scalar subquery caching-- Scalar subqueries with same input often cached:SELECT customer_id, (SELECT MAX(order_date) FROM orders WHERE customer_id = c.customer_id) as last_orderFROM customers c;-- But window function is usually faster:SELECT DISTINCT ON (customer_id) customer_id, MAX(order_date) OVER (PARTITION BY customer_id) as last_orderFROM orders;Aggregation and sorting are memory-intensive operations. Optimizing these operations is crucial for queries on large datasets.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
-- Index for ORDER BYCREATE INDEX idx_orders_date_desc ON orders(order_date DESC); -- Query uses index to avoid sorting:SELECT * FROM orders ORDER BY order_date DESC LIMIT 10;-- "Index Scan" instead of "Sort + Seq Scan" -- LIMIT with ORDER BY optimization-- GOOD: Database can stop after N rowsSELECT * FROM orders ORDER BY order_date DESC LIMIT 10; -- PROBLEMATIC: Must process all rows before limitingSELECT * FROM orders ORDER BY random() LIMIT 10; -- No early termination -- Approximate distinct counts for large data-- SLOW: Exact countSELECT COUNT(DISTINCT customer_id) FROM orders; -- Processes all rows -- FAST: HyperLogLog approximation (PostgreSQL)SELECT approx_count_distinct(customer_id) FROM orders; -- FAST: Sampling for estimatesSELECT COUNT(DISTINCT customer_id) * 100 AS estimated_countFROM orders TABLESAMPLE SYSTEM(1); -- 1% sample -- Group by optimization with indexCREATE INDEX idx_orders_customer ON orders(customer_id); -- Can use index for GROUP BY:SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id; -- Partial aggregation for distributed queries-- Pre-aggregate in subquerySELECT region, SUM(daily_total) AS monthly_totalFROM ( SELECT region, date, SUM(amount) AS daily_total FROM transactions GROUP BY region, date) dailyWHERE date >= '2024-01-01' AND date < '2024-02-01'GROUP BY region; -- Avoid DISTINCT when GROUP BY suffices-- SLOWER:SELECT DISTINCT customer_id FROM orders; -- FASTER (often):SELECT customer_id FROM orders GROUP BY customer_id; -- HAVING vs WHERE-- GOOD: Filter before grouping (WHERE)SELECT customer_id, SUM(total)FROM ordersWHERE order_date >= '2024-01-01' -- Filters rows firstGROUP BY customer_idHAVING SUM(total) > 1000; -- Filters groups after -- BAD: Filtering in HAVING when WHERE worksSELECT customer_id, SUM(total)FROM ordersGROUP BY customer_idHAVING SUM(total) > 1000 AND order_date >= '2024-01-01'; -- Error or wrongIn interviews, optimization discussions demonstrate depth. Here's how to approach them effectively:
When to Bring Up Optimization:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
/* Interview Question: "Find the top 5 products by revenue for each category" First, write a correct solution: */ WITH product_revenue AS ( SELECT p.product_id, p.product_name, p.category_id, SUM(oi.quantity * oi.unit_price) AS revenue FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.product_name, p.category_id),ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY category_id ORDER BY revenue DESC ) AS rank FROM product_revenue)SELECT * FROM ranked WHERE rank <= 5; /* Then discuss optimization: "For this query at scale, I would consider: 1. INDEXING: - Index on order_items(product_id) for the join - Potentially a covering index: order_items(product_id) INCLUDE (quantity, unit_price) 2. MATERIALIZATION: - If this runs frequently, a materialized view of product_revenue, refreshed periodically 3. PARTITIONING: - If order_items is very large, partition by date and only aggregate recent periods 4. APPROXIMATE: - For dashboards, could use sampling for near-real-time estimates instead of exact calculations 5. OBSERVE THE PLAN: - Run EXPLAIN ANALYZE to verify join strategy - Ensure the window function doesn't spill to disk - Check if category_id index helps the partition"*/First write a correct, readable solution. Premature optimization distracts from demonstrating SQL knowledge. Mention optimizations after correctness is established, or when specifically asked. 'Works correctly' beats 'faster but wrong' every time.
You've completed the SQL Query Writing module with a strong foundation in query optimization—the final piece that transforms SQL knowledge into practical database expertise.
Key Takeaways:
Module Complete:
You've now mastered the full spectrum of SQL query writing—from complex query construction and multi-table joins, through subqueries and window functions, to query optimization. These skills prepare you to write sophisticated, efficient SQL in any technical interview or production environment.
Congratulations! You now possess the complete SQL toolkit for technical interviews: complex queries, joins, subqueries, window functions, and optimization awareness. This combination of correctness and performance understanding is exactly what interviewers seek in senior database engineers.