Query Optimization - Learning Module

Loading content...

0/273

Common Performance Pitfalls

Learning from Production Disasters

Every experienced database professional has war stories—queries that brought production to its knees, optimizations that made things worse, and silent time bombs that exploded at the worst possible moment.

These stories share common themes. The same pitfalls trap developer after developer, year after year. By learning to recognize these patterns before they cause outages, you can save yourself and your organization significant pain.

This page catalogs the most common performance pitfalls, explains why they occur, and provides actionable strategies for detection and prevention. Consider this your field guide to database performance antipatterns.

What You Will Learn

By the end of this page, you will recognize and prevent: N+1 query problems; unbounded queries missing pagination; lock contention patterns; statistics-related plan instability; over- and under-indexing; query plan regression triggers; and monitoring blind spots that hide problems until they explode.

The N+1 Query Problem

The N+1 query problem is perhaps the most common performance issue in applications using ORMs. It occurs when code fetches a list of items (1 query), then fetches related data for each item individually (N queries).

N+1 Query Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# The Problem (Python/SQLAlchemy style pseudocode)
 
# Query 1: Fetch 100 orders
orders = db.query("SELECT * FROM orders WHERE status = 'pending' LIMIT 100")
 
# Queries 2-101: Fetch customer for EACH order
for order in orders:
    customer = db.query(f"SELECT * FROM customers WHERE id = {order.customer_id}")
    print(f"{customer.name}: {order.total}")
 
# Result: 101 database round trips
# At 5ms per query = 505ms just for network latency
 
 
# The Solution: Eager Loading / Join
 
# Single query with JOIN
results = db.query("""
    SELECT o.*, c.name as customer_name 
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    WHERE o.status = 'pending' 
    LIMIT 100
""")
 
# Result: 1 database round trip
# At 5ms + slightly more processing = ~8ms total

Why N+1 Is So Insidious:

It works in development: With 10 records and localhost database, 11 queries take 10ms—not noticeable
It scales linearly with data: 10,000 records = 10,001 queries = 50+ seconds
ORMs hide the queries: The application code looks clean—no visible loops generating SQL
Connection pool exhaustion: Many small queries can exhaust available connections

Detection Strategies:

Method	How It Works
Query logging	Count distinct query patterns per request
APM tools	Flag endpoints with high query counts
Database metrics	Monitor queries-per-second spikes
Code review	Look for database access in loops
ORM debugging	Enable query logging in development

N+1 Prevention Strategies

•Eager loading: ORM features like include/preload that generate JOINs or batch fetches
•Batch fetching: Collect IDs, fetch all at once: WHERE id IN (1,2,3,...)
•DataLoader pattern: Batch and cache requests within a single request context
•GraphQL loaders: Framework-level batching for nested resolvers
•Denormalization: Store frequently-accessed related data directly where it's needed

ORM Lazy Loading Is the Cause

Most ORMs default to 'lazy loading'—related objects are fetched only when accessed. This is convenient but creates N+1 problems. Consider making eager loading the default and lazy loading opt-in for your critical paths.

Unbounded Queries

An unbounded query is one that can return an unlimited number of rows. These queries work fine with small datasets but become catastrophic as data grows—eventually returning millions of rows, exhausting memory, and timing out.

Unbounded Query Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- DANGEROUS: No LIMIT
SELECT * FROM events WHERE created_at > '2024-01-01';
-- Could return 10 rows or 10 million rows
 
-- DANGEROUS: Admin dashboard showing "all users"
SELECT * FROM users ORDER BY created_at DESC;
-- Works with 1,000 users; OOM with 1,000,000 users
 
-- DANGEROUS: Export query
SELECT * FROM transactions WHERE merchant_id = 123;
-- High-volume merchants may have millions of transactions
 
-- DANGEROUS: Count without upper bound
SELECT COUNT(*) FROM logs WHERE level = 'ERROR';
-- Counting billions of rows locks the table
 
 
-- SAFE: Always use pagination
SELECT * FROM events 
WHERE created_at > '2024-01-01'
ORDER BY created_at
LIMIT 100 OFFSET 0;  -- Or keyset pagination (see below)
 
-- SAFE: Bounded count with short-circuit
SELECT COUNT(*) FROM (
    SELECT 1 FROM logs WHERE level = 'ERROR' LIMIT 1001
) subq;
-- If > 1000, shows "1000+"; doesn't count the full table

The Pagination Performance Trap:

Even paginated queries can be problematic:

-- OFFSET-based pagination degrades linearly
SELECT * FROM products ORDER BY name LIMIT 20 OFFSET 10000;
-- Database must read and skip 10,000 rows to return 20

-- For page 5000 (OFFSET 100000):
-- Skip 100,000 rows → return 20 rows = terrible performance

Solution: Keyset Pagination

-- Initial page
SELECT * FROM products ORDER BY created_at, id LIMIT 20;

-- Next page (using last row's values)
SELECT * FROM products 
WHERE (created_at, id) > ('2024-01-15 09:30:00', 12345)
ORDER BY created_at, id 
LIMIT 20;
-- Uses index, no row skipping, constant performance

Set Query Timeouts

Always set statement timeouts at the connection or query level: SET statement_timeout = '30s' (PostgreSQL) or MAX_EXECUTION_TIME hint (MySQL). This prevents runaway queries from monopolizing resources. Better a timeout error than a crashed database.

Lock Contention and Blocking

Lock contention occurs when multiple transactions compete for the same resources. What seems like a "slow query" may actually be a fast query waiting for locks held by another transaction.

Common Lock Contention Patterns:

Pattern	Cause	Impact
Long-running transactions	Web request with multiple writes held open	Locks held for seconds, blocking others
Hot row updates	Counter, queue, or popular item updated frequently	Serialized access, throughput collapse
Full table locks	DDL operations, some MySQL ALTER TABLE operations	All access blocked during operation
Gap locks (InnoDB)	Range scans in serializable/repeatable read isolation	Phantom rows prevented but lock ranges
Unindexed foreign keys (Oracle)	Child table UPDATE locks parent table rows	Cascading lock escalation

Lock Contention Detection and Prevention
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- PostgreSQL: Find blocking queries
SELECT 
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocking.state,
    NOW() - blocking.query_start AS blocking_duration
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked ON blocked.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
 
 
-- MySQL: Find locks and waits
SELECT * FROM information_schema.innodb_lock_waits;
SELECT * FROM performance_schema.data_lock_waits;
 
 
-- Prevention: Keep transactions short
-- BAD: Open transaction during external call
BEGIN;
INSERT INTO orders ...;
-- HTTP call to payment processor (3 seconds)
UPDATE orders SET payment_confirmed = true ...;
COMMIT;  -- Locks held for 3+ seconds
 
-- GOOD: Defer locks until needed
-- HTTP call to payment processor (3 seconds)
BEGIN;
INSERT INTO orders ...;
UPDATE orders SET payment_confirmed = true ...;
COMMIT;  -- Locks held for milliseconds

Lock Contention Prevention

•Keep transactions short: Don't hold transactions open during I/O, HTTP calls, or user think time
•Avoid hot rows: Shard counters, use queues, or batch updates to distribute load
•Use appropriate isolation levels: Don't use SERIALIZABLE when READ COMMITTED suffices
•Lock in consistent order: If multiple objects locked, always acquire in same order to prevent deadlocks
•Use skip locked: FOR UPDATE SKIP LOCKED processes only unlocked rows, avoiding contention
•Monitor lock wait time: Alert on lock waits exceeding threshold (e.g., 1 second)

Deadlocks Are Not Inherently Bad

Databases detect and break deadlocks by aborting one transaction. The application should catch the error and retry. Occasional deadlocks are normal; frequent deadlocks indicate a design problem—inconsistent lock ordering or unnecessary contention.

Statistics-Related Issues

The query optimizer relies on statistics to estimate row counts and choose execution plans. Stale, missing, or misleading statistics cause the optimizer to make poor decisions—sometimes catastrophically poor.

Symptoms of Stale Statistics

•EXPLAIN estimated rows differ wildly from actual
•Query suddenly becomes slow after data changes
•Optimizer chooses full scans on selective queries
•Hash joins build on larger table
•Nested loops used where hash join would be better

Statistics Best Practices

•Enable auto-analyze/auto-stats collection
•ANALYZE after bulk data loads
•Monitor for large estimate vs actual discrepancies
•Use extended statistics for correlated columns
•Check statistics age against table modification rate

Statistics Management
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- PostgreSQL: Check statistics freshness
SELECT 
    schemaname,
    relname,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    n_live_tup,
    n_dead_tup
FROM pg_stat_user_tables
ORDER BY last_analyze NULLS FIRST;
 
-- PostgreSQL: Manually update statistics
ANALYZE orders;  -- Specific table
ANALYZE;  -- All tables (caution in production)
 
-- PostgreSQL: Extended statistics for correlated columns  
CREATE STATISTICS stats_customer_country_region 
ON country, region FROM customers;
ANALYZE customers;
 
 
-- MySQL: Check statistics
SELECT 
    table_name,
    table_rows,
    last_update
FROM information_schema.tables 
WHERE table_schema = 'your_db';
 
-- MySQL: Update statistics
ANALYZE TABLE orders;
 
 
-- SQL Server: Update statistics
UPDATE STATISTICS orders;
-- Or with full scan for accuracy:
UPDATE STATISTICS orders WITH FULLSCAN;

The Post-Bulk-Load Problem

After bulk loading data (ETL, migrations), statistics don't exist for the new data. The optimizer estimates zero rows and makes disastrous plan choices. Always ANALYZE immediately after bulk operations. Some databases auto-trigger this, but verify.

Misuse of SELECT *

SELECT * is convenient but creates performance and maintainability problems. Understanding when it's acceptable and when it's harmful helps avoid common pitfalls.

Problems with SELECT *

•Prevents covering indexes: Index-only scans require all selected columns in the index. SELECT * always requires table access.
•Transmits unnecessary data: Large TEXT/BLOB columns transferred even if unused. Wastes network bandwidth and memory.
•Schema change brittleness: Adding columns changes query results. Application may receive unexpected data.
•Hides column understanding: Developers don't know what data the query returns without examining schema.
•ORM memory overhead: Entity objects populated with unneeded fields consume memory.

SELECT * Problems Illustrated
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Table: users (id, name, email, bio TEXT, profile_photo BYTEA)
 
-- PROBLEM: Fetching binary data unnecessarily
SELECT * FROM users WHERE id = 123;
-- Returns profile_photo (1MB) even if you only need name
 
-- SOLUTION: Explicit columns
SELECT id, name, email FROM users WHERE id = 123;
-- Returns only needed data; can use covering index
 
 
-- PROBLEM: Covering index not used
CREATE INDEX idx_users_email_name ON users (email, name);
 
SELECT * FROM users WHERE email = 'test@example.com';
-- Cannot use index-only scan; must access table for all columns
 
SELECT email, name FROM users WHERE email = 'test@example.com';  
-- Index-only scan possible; much faster
 
 
-- ACCEPTABLE uses of SELECT *:
-- 1. Column validation/exploration in development
-- 2. When you genuinely need all columns AND table is small
-- 3. In EXISTS subqueries (SELECT 1 FROM ... is equivalent)
-- 4. Migrating data: INSERT INTO new_table SELECT * FROM old_table

The WIDTH Problem:

Row width (total bytes per row) directly affects performance:

Fewer rows fit per page → more I/O for same row count
More memory consumed for result processing
Network transfer takes longer
Client application uses more memory

For tables with TEXT/BLOB columns or many wide VARCHAR columns, SELECT * can return 10x-100x more bytes than necessary.

ORM Projection Features

Most ORMs support projection—fetching only specific columns. Django's .only() and .defer(), Rails' .select(), SQLAlchemy's load_only(), Entity Framework's .Select(). Use these for queries where you don't need all columns, especially for list views.

Missing or Wrong Indexes

Indexing problems manifest in two opposite ways: missing indexes cause full scans on large tables, while excessive indexes slow writes and waste storage. Finding the right balance requires understanding actual query patterns.

Detecting Missing Indexes:

Missing Index Detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- PostgreSQL: Find sequential scans on large tables
SELECT 
    schemaname,
    relname,
    seq_scan,
    seq_tup_read,
    idx_scan,
    pg_size_pretty(pg_relation_size(relid)) as size
FROM pg_stat_user_tables
WHERE seq_scan > 0 
  AND pg_relation_size(relid) > 10000000  -- Tables > 10MB
ORDER BY seq_tup_read DESC;
 
-- High seq_tup_read on large tables indicates potential missing indexes
 
-- PostgreSQL: Find slow queries (requires pg_stat_statements)
SELECT 
    query,
    calls,
    total_exec_time / 1000 as total_seconds,
    mean_exec_time as avg_ms,
    rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
 
 
-- MySQL: Use Performance Schema
SELECT 
    digest_text,
    count_star,
    avg_timer_wait/1000000000 as avg_ms,
    sum_rows_examined,
    sum_rows_sent
FROM performance_schema.events_statements_summary_by_digest
ORDER BY avg_timer_wait DESC
LIMIT 20;
 
 
-- SQL Server: Missing Index DMVs (built-in recommendations)
SELECT 
    migs.avg_user_impact,
    mid.statement,
    mid.equality_columns,
    mid.inequality_columns,
    mid.included_columns
FROM sys.dm_db_missing_index_group_stats migs
JOIN sys.dm_db_missing_index_groups mig ON migs.group_handle = mig.index_group_handle
JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
ORDER BY migs.avg_user_impact * migs.user_seeks DESC;

Signs of Over-Indexing:

Symptom	Cause
Slow INSERT/UPDATE	Every index must be updated
Large storage usage	Indexes may exceed table size
Many unused indexes	Indexes created but never scanned
Duplicate indexes	idx(a) and idx(a,b) both exist
Competing indexes	Multiple indexes on same columns in different orders

Don't Trust Database Index Recommendations Blindly

SQL Server and Oracle suggest 'missing indexes' based on query workload. These suggestions are often good starting points but can recommend redundant indexes or indexes only useful for rare queries. Always evaluate whether the read improvement justifies the write overhead.

Query Plan Regressions

A query that was fast yesterday becomes slow today—not because the query changed, but because the execution plan changed. This is a "plan regression," and it's one of the most frustrating performance problems to diagnose.

Common Causes of Plan Regression

•Statistics update: ANALYZE/autovacuum updated statistics, changing cost estimates
•Data distribution shift: New data has different patterns than old data
•Parameter value change: Different parameter values cause different plan choices
•Memory pressure: Less available memory causes smaller hash tables, worse plans
•Database upgrade: Optimizer improvements sometimes cause regressions
•Index creation/deletion: New index chosen badly, or needed index dropped
•Configuration change: work_mem, random_page_cost, etc. affect plan choices

Prevention and Mitigation:

Strategy	Description
Plan baselines	SQL Server, Oracle: Lock known-good plans
Query store	Capture historical plans for comparison
Monitoring	Alert on query execution time increases
Controlled deploys	Deploy schema/index changes during low traffic
Testing	Load test with production-like data before changes
Rollback capability	Be prepared to revert statistics/settings

Plan Regression Investigation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- SQL Server: Query Store for plan history
SELECT 
    q.query_id,
    qt.query_sql_text,
    rs.avg_duration,
    p.query_plan,
    rs.first_execution_time,
    rs.last_execution_time
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
WHERE qt.query_sql_text LIKE '%orders%'
ORDER BY rs.avg_duration DESC;
 
 
-- PostgreSQL: auto_explain for continuous plan logging
-- In postgresql.conf:
-- shared_preload_libraries = 'auto_explain'
-- auto_explain.log_min_duration = '1s'
-- auto_explain.log_analyze = true
 
-- Then check logs for plan changes over time
 
 
-- Force a specific plan (PostgreSQL - last resort)
-- First, find the good plan's structure, then use:
SET enable_seqscan = off;
SET enable_nestloop = off;
-- etc., to push optimizer toward the good plan
-- Better: Fix the underlying cause (statistics, indexes)

Plan Forcing Is a Bandaid

While databases offer plan hints and forcing, these are maintenance burdens. The forced plan may become suboptimal as data changes. Focus on fixing root causes: better indexes, fresh statistics, or query rewrites. Use plan forcing only for temporary stabilization while addressing the real issue.

Monitoring Blind Spots

Many performance problems go undetected until they cause outages because monitoring focuses on the wrong metrics or misses critical signals. Understanding common blind spots helps you build comprehensive observability.

Common Monitoring Blind Spots
Blind Spot	Why It's Hidden	How to Detect
Slow queries that don't time out	Average response time looks fine; P99 is terrible	Monitor P95/P99 latency, not just average
Lock wait time	Query itself is fast; wait time isn't measured	Monitor lock_time, time_waiting separately
Connection exhaustion	Queries work; new connections fail	Monitor connection pool usage, not just errors
Disk space exhaustion	Sudden; everything breaks at once	Alert at 70% disk usage, not 95%
Replication lag	Reads work; stale data returned	Monitor seconds_behind_master / replication lag
Checkpoint impact	Periodic slowdowns during writes	Correlate checkpoint events with latency spikes
Index bloat	Gradual degradation, hard to attribute	Track index size ratio to row count over time

Essential Database Metrics to Monitor

•Query performance: P50, P95, P99 latency for critical queries; queries per second
•Resource utilization: CPU, memory, disk I/O, network; alert before exhaustion
•Connection pool: Active, idle, waiting connections; pool exhaustion rate
•Locks: Lock wait time, deadlock frequency, long-running transactions
•Replication: Lag in seconds or transactions; replication health
•Storage: Table and index sizes; dead tuple accumulation; disk space
•Cache hit ratio: Buffer cache effectiveness; working set fit
•Slow query log: Queries exceeding threshold; new slow query patterns

Building Effective Alerting:

Alert on symptoms, not just causes: Alert when response time degrades, not just when CPU hits 100%
Set meaningful thresholds: Alert at 70% disk (actionable) not 99% (panic)
Use rate-of-change: Alert on "connection count increasing rapidly" not just "connections high"
Correlate metrics: Latency spike + lock wait spike = lock contention investigation
Reduce noise: Too many alerts leads to alert fatigue and ignored warnings

Page Complete

You now have a comprehensive understanding of the most common database performance pitfalls and how to detect, prevent, and resolve them. These patterns recur across all production environments—recognizing them quickly is a core skill for any database practitioner.

Module Summary: Query Optimization Mastery

Congratulations on completing the Query Optimization module! You've learned the fundamental skills that separate database novices from experts—skills that directly impact application performance and user experience.

Module Key Takeaways

•Execution plans reveal the truth — The database tells you exactly what it's doing. Learn to read this language fluently.
•EXPLAIN ANALYZE is your diagnostic tool — Compare estimated vs actual to find where the optimizer went wrong.
•Indexes accelerate reads but cost writes — Design indexes for your actual query patterns, not hypothetical ones.
•Query structure affects plan quality — Semantically equivalent queries can have vastly different performance.
•Statistics drive optimization decisions — Stale statistics cause bad plans. Keep them fresh.
•Common pitfalls have common solutions — N+1, unbounded queries, lock contention—recognition is prevention.
•Monitoring reveals problems early — What you don't measure, you can't improve.

Your Next Steps:

Query optimization is a skill developed through practice. Apply these concepts to your own queries:

EXPLAIN every slow query before attempting fixes
Review your most frequent queries for optimization opportunities
Audit your indexes—remove unused, add missing
Implement comprehensive monitoring for your database
Create runbooks for common performance scenarios

The techniques in this module form the foundation for all database performance work. As you encounter more complex scenarios, you'll build on these fundamentals.

Module Complete

You have completed the Query Optimization module within SQL Databases Deep Dive. You now possess the knowledge to diagnose and resolve the majority of SQL query performance problems encountered in production systems.