Database Management SystemsSQL Performance

Query Performance Basics

LevelIntermediate

Duration60 mins

TopicSQL Performance

3 / 5

Bottleneck Identification

The Art of Finding What's Slow

When a database-backed application becomes slow, the natural instinct is to start optimizing: add an index, increase memory, upgrade hardware. But optimization without diagnosis is gambling—you might fix the wrong thing, waste resources, or even make performance worse. Bottleneck identification is the disciplined process of locating exactly where performance is constrained before attempting solutions.

A bottleneck is any point in the system where throughput is constrained or latency is introduced. Like a traffic jam at a highway merge, the bottleneck limits overall system capacity regardless of how fast other components operate. Identifying the true bottleneck—as opposed to symptoms or secondary effects—is the essential first step in performance improvement.

What You Will Learn

By the end of this page, you will understand systematic approaches to bottleneck identification, learn to distinguish symptoms from root causes, master the use of wait analysis and execution plan examination, and develop the diagnostic mindset essential for database performance work.

The Bottleneck Principle

The Theory of Constraints, developed by Eliyahu Goldratt, provides a fundamental insight: every system has exactly one constraint (bottleneck) that limits its throughput at any given time. Improving performance at non-bottleneck points yields no system-level improvement—you simply shift the queue elsewhere.

In database systems, this principle manifests clearly:

Optimizing queries that contribute 1% of total load provides negligible benefit
Adding CPU to an I/O-bound system wastes money
Indexing a table accessed once per day while ignoring a table hit 10,000 times per second misallocates effort

The Diagnostic Imperative

Before optimizing, you must:

Identify — Locate the actual bottleneck with data, not intuition
Quantify — Measure the impact: how much time/resources does it consume?
Validate — Confirm that addressing this constraint will improve overall performance
Optimize — Apply targeted fixes to the identified bottleneck
Re-evaluate — After fixing, the bottleneck shifts—find the new one

Bottleneck Identification vs. Shotgun Optimization
Aspect	Bottleneck-Driven	Shotgun Approach
Process	Diagnose → Target → Fix → Verify	Guess → Try multiple fixes → Hope
Effort efficiency	High: focused on actual constraint	Low: wastes effort on non-issues
Risk	Low: data-driven decisions	High: may introduce regressions
Reproducibility	Systematic, teachable	Ad-hoc, personality-dependent
Outcome certainty	High confidence in improvement	Uncertain if changes helped

The 80/20 of Database Performance

In most systems, 5-10 queries cause 80-90% of resource consumption. Finding and fixing these "top offenders" yields disproportionate returns. The goal of bottleneck identification is to find these high-impact targets with precision.

Symptoms vs. Root Causes

Effective diagnosis requires distinguishing between what you observe (symptoms) and what causes the problem (root causes). Treating symptoms often provides temporary relief while leaving the underlying issue unresolved—or worse, masking it until catastrophic failure.

Symptom: High CPU usage Possible Root Causes:

Inefficient query performing full table scan
Missing index forcing sort operations
Excessive function calls in SELECT
Suboptimal execution plan
Legitimate high load (may need scaling, not optimization)

Symptom: Slow application response Possible Root Causes:

Slow database queries (but which ones?)
Lock contention blocking transactions
Connection pool exhaustion
Network latency to database server
Application-side issues (not database at all)

Common Symptoms

•Queries taking longer than usual
•High CPU utilization
•Memory pressure / swapping
•Disk I/O saturation
•Connection pool exhaustion
•Lock wait timeouts
•Application timeouts

Root Cause Categories

•Missing or suboptimal indexes
•Inefficient query structure
•Poor execution plan choice
•Inadequate resource allocation
•Schema design problems
•Transaction scope issues
•Configuration mistuning

The Five Whys Technique

A structured approach to root cause analysis:

Why is the application slow? → Database queries are taking 5+ seconds
Why are queries slow? → The orders_report query takes 8 seconds
Why does orders_report take 8 seconds? → It's performing a full table scan
Why is it scanning the full table? → No index exists on the filter columns
Why is there no index? → The table grew beyond original design assumptions

Root Cause Identified: Missing index on frequently-filtered columns in a grown dataset Action: Create appropriate index on orders(customer_id, order_date)

Beware Symptom-Chasing

Seeing high CPU and immediately adding more CPU cores addresses the symptom, not the cause. If the high CPU stems from a full table scan that an index would eliminate, you've now got more cores doing unnecessary work. The query still scans every row—just faster. Root cause analysis prevents this waste.

Wait Analysis

Wait analysis is the most powerful bottleneck identification technique. The premise is simple: when a query is slow, it's either working (using CPU) or waiting (for something). By examining what queries wait for, we directly identify constraints.

Database systems track wait statistics—accumulating time spent waiting for various resource types. Analyzing these statistics reveals the dominant bottlenecks.

Common Wait Categories:

I/O Waits — Waiting for disk reads/writes
Lock Waits — Waiting for other transactions to release locks
Latch Waits — Waiting for internal memory structure access
Network Waits — Waiting for client to consume results
CPU Waits — Ready to run but no available CPU (signal wait)
Memory Waits — Waiting for memory allocation (query grants)

wait-analysis.sql

-- System-wide wait statistics
-- Shows what the database spends most time waiting for
SELECT TOP 20
    wait_type,
    wait_time_ms / 1000.0 AS wait_sec,
    100.0 * wait_time_ms / SUM(wait_time_ms) OVER() AS wait_pct,
    signal_wait_time_ms / 1000.0 AS signal_wait_sec,
    waiting_tasks_count,
    CASE 
        WHEN wait_type LIKE 'LCK_%' THEN 'Lock'
        WHEN wait_type LIKE 'PAGEIO%' THEN 'Disk I/O'
        WHEN wait_type LIKE 'ASYNC_NETWORK%' THEN 'Network'
        WHEN wait_type = 'SOS_SCHEDULER_YIELD' THEN 'CPU Pressure'
        WHEN wait_type LIKE 'LATCH%' THEN 'Latch'
        WHEN wait_type LIKE 'CXPACKET%' THEN 'Parallelism'
        WHEN wait_type LIKE 'RESOURCE_SEMAPHORE%' THEN 'Memory Grant'
        ELSE 'Other'
    END AS wait_category
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    -- Exclude benign/background waits
    'BROKER_TASK_STOP', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
    'LAZYWRITER_SLEEP', 'SLEEP_TASK', 'WAITFOR'
)
AND wait_time_ms > 0
ORDER BY wait_time_ms DESC;
 
/* Sample Analysis:
   PAGEIOLATCH_SH   45% → System is I/O bound (disk reads)
   LCK_M_X          25% → Significant lock contention
   SOS_SCHEDULER_YIELD 15% → Some CPU pressure
   ASYNC_NETWORK_IO 10% → Clients slow to consume results
   
   Action: Focus on I/O reduction first (45% of waits)
           Then investigate locking issues (25%)
*/
 
-- Wait analysis for specific session
SELECT 
    session_id,
    wait_type,
    wait_time,
    blocking_session_id,
    resource_description
FROM sys.dm_exec_requests
WHERE session_id = @target_session_id;

Wait Event Interpretation Guide
Wait Category	Indicates	Investigation Focus	Common Solutions
Page I/O	Disk read bottleneck	Buffer hit ratio, disk latency	Add indexes, increase buffer pool, faster storage
Lock	Transaction contention	Blocking queries, transaction scope	Optimize transactions, reduce scope, check for deadlocks
CPU Scheduler	CPU saturation	CPU-intensive queries, core count	Optimize queries, add CPU, distribute load
Memory Grant	Memory pressure	Large sorts/hashes, concurrent queries	Reduce work_mem, optimize queries, add RAM
Network	Slow result consumption	Result size, client performance	Reduce columns, add pagination, check network
Latch	Internal contention	Hot pages, tempdb pressure	Reduce tempdb usage, distribute hot objects

Wait Analysis Interpretation

Wait statistics are cumulative since server start (or last reset). For meaningful analysis, either reset stats before/after a workload or use windowed analysis (comparing snapshots). Also consider wait frequency, not just total time—many short waits may indicate different problems than few long waits.

Top-N Query Analysis

Top-N analysis identifies the queries consuming the most resources. The logic is straightforward: if 10 queries consume 80% of resources, optimizing those 10 provides far more value than optimizing the other 10,000.

Critical Insight: Total vs. Average

A query can appear in top-N lists for different reasons:

High per-execution cost — Each execution is expensive (complex query, large data)
High execution frequency — Individually cheap but executed constantly (hot path)
Both — Expensive AND frequent (urgent priority)

The most impactful optimizations target queries that rank high on both dimensions. However, a query executed 1 million times at 5ms each (5,000 total seconds) may be more impactful than one executed 10 times at 100ms each (1 total second).

top-n-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Top queries by total execution time
-- These contribute most to overall database load
SELECT 
    queryid,
    calls,
    ROUND(total_exec_time::numeric / 1000, 2) AS total_sec,
    ROUND(mean_exec_time::numeric, 2) AS avg_ms,
    ROUND(100.0 * total_exec_time / 
          SUM(total_exec_time) OVER(), 2) AS pct_of_total,
    rows AS total_rows_returned,
    shared_blks_hit + shared_blks_read AS total_blocks,
    LEFT(query, 80) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Multi-dimensional ranking
-- Find queries that are both slow AND frequent
SELECT 
    queryid,
    calls,
    mean_exec_time AS avg_ms,
    total_exec_time / 1000 AS total_sec,
    -- Impact score: combines frequency and duration
    ROUND((calls * mean_exec_time / 1000)::numeric, 2) AS impact_score,
    NTILE(10) OVER (ORDER BY calls DESC) AS frequency_decile,
    NTILE(10) OVER (ORDER BY mean_exec_time DESC) AS slowness_decile,
    LEFT(query, 60) AS query_preview
FROM pg_stat_statements
WHERE calls > 100  -- Minimum frequency threshold
ORDER BY calls * mean_exec_time DESC
LIMIT 20;
 
-- Queries in top decile for BOTH frequency and slowness
-- are highest priority optimization targets

Top-N Analysis Dimensions

•Total Execution Time — Overall contribution to database load; primary optimization target
•Execution Count — Hot paths; even small improvements multiply across executions
•Average Duration — Slow queries affecting user experience; latency-sensitive targets
•Total Logical Reads — I/O consumers; candidates for indexing
•CPU Time — Compute-intensive; candidates for algorithm optimization
•Temp Disk Usage — Memory-exceeding operations; candidates for work_mem tuning

Reset Statistics Strategically

Statistics accumulated over months include long-gone queries and historical patterns. For analysis of current behavior, reset statistics (pg_stat_statements_reset(), sys.sp_query_store_reset_exec_stats) and collect fresh data for a representative period (hours to days depending on workload).

Execution Plan Analysis

Once top-N analysis identifies problematic queries, execution plan analysis reveals why they're slow. The execution plan is the database's step-by-step recipe for executing a query—analyzing it shows exactly where time and resources are consumed.

Key Execution Plan Elements

Operators — The individual steps (scan, seek, join, sort, etc.)
Cardinality Estimates — Expected row counts at each step
Actual Row Counts — Real rows processed (available with ANALYZE/actual plans)
Cost Estimates — Relative cost units assigned by optimizer
I/O Statistics — Buffer reads, cache hits, physical reads

execution-plan-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Estimated plan (no execution)
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
 
-- Actual plan with execution statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;
 
/* Example output with bottleneck highlighted:
 
   Seq Scan on orders  (cost=0.00..235864.00 rows=1250 width=120)
     (actual time=0.012..1523.456 rows=1234 loops=1)
     Filter: (customer_id = 12345)
     Rows Removed by Filter: 9998766     <-- BOTTLENECK!
     Buffers: shared hit=1234 read=98765
   Planning Time: 0.123 ms
   Execution Time: 1524.567 ms
   
   Analysis:
   - Seq Scan = reading entire table (no index used)
   - 9.9M rows examined to find 1,234 matches
   - 98,765 pages read from disk
   - Solution: CREATE INDEX idx_orders_customer ON orders(customer_id);
*/
 
-- After adding index:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;
 
/* Improved output:
   Index Scan using idx_orders_customer on orders
     (actual time=0.034..2.567 rows=1234 loops=1)
     Index Cond: (customer_id = 12345)
     Buffers: shared hit=145
   Execution Time: 2.890 ms   <-- 500x improvement!
*/

Execution Plan Red Flags
Pattern	Symptom	Likely Cause	Solution
Table Scan / Seq Scan	Reading entire table	No usable index	Create index on filter/join columns
Estimated ≠ Actual rows (10x+)	Optimizer made wrong choice	Stale statistics	Update statistics, analyze table
Sort with spill warning	Sorting to disk	Insufficient memory	Add index for sort, increase work_mem
Key Lookup / RID Lookup	Extra I/O per row	Non-covering index	Add INCLUDE columns to index
Nested Loop with many outer rows	O(n×m) joins	Missing index on inner table	Add index, consider hash join
Parallelism with CXPACKET waits	Uneven work distribution	Skewed data, bad estimate	Update stats, consider serial plan

Estimates vs. Actuals: The Critical Gap

When estimated rows differ dramatically from actual rows (10x or more), the optimizer likely chose a suboptimal plan based on incorrect assumptions. This is often the root cause of performance instability—the same query sometimes runs fast (good estimate → good plan) and sometimes slow (bad estimate → bad plan).

Lock and Block Analysis

Lock contention is a frequent bottleneck in concurrent database systems. When multiple transactions compete for the same resources, some must wait. Identifying blocking chains and lock hotspots is essential for high-concurrency workloads.

Lock Hierarchy

Databases use hierarchical locking:

Database → Schema → Table → Page → Row

Locks at higher levels (table) block more concurrent access than locks at lower levels (row). Some operations escalate from row to table locks when too many rows are affected.

Blocking Chains

A blocking chain forms when:

Session A holds lock L1
Session B waits for L1, holds L2
Session C waits for L2

Session A is the "head blocker." Killing session A releases the entire chain.

lock-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Current locks and blocking
SELECT 
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocked.wait_event_type,
    blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
 
-- All current locks with details
SELECT 
    l.locktype,
    l.relation::regclass AS table_name,
    l.mode,
    l.granted,
    l.pid,
    a.query,
    EXTRACT(EPOCH FROM (now() - a.query_start)) AS duration_sec
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.relation IS NOT NULL
  AND NOT l.granted  -- Show waiting locks
ORDER BY a.query_start;
 
-- Lock types:
-- AccessShareLock: SELECT
-- RowShareLock: SELECT FOR UPDATE
-- RowExclusiveLock: INSERT, UPDATE, DELETE
-- ShareLock: CREATE INDEX CONCURRENTLY
-- ExclusiveLock: VACUUM, ALTER TABLE

Common Locking Problems

•Long-running transactions — Holding locks while waiting for user input or external systems
•Table-level locks — DDL operations, LOCK TABLE, excessive row locks escalating
•Hot row contention — Multiple transactions updating the same row (counters, status flags)
•Deadlocks — Circular wait chains that require transaction abort
•Lock escalation — Row locks escalating to table locks under memory pressure

Reduce Lock Scope, Not Lock Duration

The best lock optimization reduces the scope of what's locked (accessing fewer rows) rather than the duration (using NOLOCK). Using NOLOCK or READ UNCOMMITTED sacrifices consistency. Instead: use precise WHERE clauses, batch large updates, minimize transaction scope, and ensure appropriate indexes exist.

Systematic Diagnosis Framework

Combining the techniques covered, here's a systematic framework for bottleneck identification that ensures consistent, thorough analysis.

The WAIT→QUERY→PLAN Framework

Step 1: WAIT — What is the system waiting for?

Examine wait statistics to identify dominant wait types
Categorize: I/O, Lock, CPU, Memory, Network
This reveals the resource constraint type

Step 2: QUERY — Which queries contribute most?

Top-N analysis by total resource consumption
Filter for queries related to the constraint type
Identify the specific statements causing issues

Step 3: PLAN — Why are those queries slow?

Examine execution plans for identified queries
Look for red flags: scans, estimate mismatches, spills
Identify specific optimization opportunities

WAIT→QUERY→PLAN Decision Matrix
If WAIT shows...	QUERY focus	PLAN look for...	Typical Fix
Page I/O waits	High logical/physical reads	Table scans, key lookups	Add indexes, covering indexes
Lock waits	Long-running UPDATE/DELETE	Large row counts, table locks	Batch updates, reduce scope
CPU waits	High CPU time per execution	Complex expressions, sorts	Simplify logic, add sort indexes
Memory waits	High temp disk usage	Sorts, hash spills	Increase work_mem, reduce data
Network waits	Large result sets	SELECT *, wide rows	Reduce columns, add pagination

diagnosis-checklist.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Step 1: WAIT ANALYSIS
-- Run wait stats query (see Section 3)
-- Identify top 3 wait categories
 
-- Step 2: QUERY IDENTIFICATION
-- Run top-N query (see Section 4)
-- Focus on queries matching wait category:
--   I/O waits → Sort by physical_reads
--   CPU waits → Sort by cpu_time  
--   Lock waits → Look at UPDATE/DELETE statements
 
-- Step 3: PLAN ANALYSIS
-- For each top query:
EXPLAIN (ANALYZE, BUFFERS) <top_query_here>;
 
-- Checklist for each plan:
-- [ ] Any Seq Scan / Table Scan on large tables?
-- [ ] Estimated rows within 10x of actual?
-- [ ] Any spill warnings (temp files)?
-- [ ] Any Key Lookup / RID Lookup operations?
-- [ ] Join types appropriate for data volumes?
 
-- Step 4: VALIDATE HYPOTHESIS
-- Before fixing, predict the improvement:
-- "Adding index X should eliminate the seq scan,
--  reducing I/O from 100,000 pages to ~100 pages,
--  improving execution from 5s to ~50ms"
 
-- Step 5: IMPLEMENT AND MEASURE
-- Apply fix, re-measure, confirm improvement
-- If improvement < expected, reassess hypothesis

The Feedback Loop

After fixing a bottleneck, the system finds a new one. This is expected and healthy. Repeat the WAIT→QUERY→PLAN cycle until performance meets requirements. Each iteration should target the current dominant constraint. If fixing a bottleneck doesn't improve overall performance, reassess—you may have fixed a symptom rather than the root cause.

Summary: Bottleneck Identification

Bottleneck identification transforms database performance work from guesswork into systematic engineering. Let's consolidate the essential concepts:

Key Takeaways

•Every system has one bottleneck at a time — Identify the actual constraint before optimizing; non-bottleneck improvements are wasted effort
•Distinguish symptoms from root causes — Use techniques like Five Whys to trace observable problems to underlying issues
•Wait analysis reveals constraint types — Examining what queries wait for directly identifies resource bottlenecks
•Top-N analysis locates high-impact queries — A small number of queries typically cause most resource consumption
•Execution plans explain why queries are slow — Look for scans, estimate mismatches, and spill warnings
•Lock analysis addresses concurrency issues — Identify blocking chains and lock hotspots for concurrent workloads
•Use systematic frameworks — WAIT→QUERY→PLAN provides consistent, thorough diagnosis

What's Next

With bottleneck identification mastered, we need the tools to perform this analysis efficiently. The next page explores Measurement Tools—the diagnostic instruments that enable wait analysis, query profiling, execution plan examination, and ongoing performance monitoring.

Page Complete

You now possess a systematic framework for identifying database performance bottlenecks. You can distinguish symptoms from root causes, use wait analysis and top-N queries to locate constraints, read execution plans to understand why queries are slow, and apply the WAIT→QUERY→PLAN methodology. Next, we'll explore the tools that make this analysis efficient.

3 / 5

Loading learning content...

Database Management SystemsSQL Performance

Query Performance Basics

LevelIntermediate

Duration60 mins

TopicSQL Performance

3 / 5

Bottleneck Identification

The Art of Finding What's Slow

What You Will Learn

The Bottleneck Principle

In database systems, this principle manifests clearly:

Optimizing queries that contribute 1% of total load provides negligible benefit
Adding CPU to an I/O-bound system wastes money
Indexing a table accessed once per day while ignoring a table hit 10,000 times per second misallocates effort

The Diagnostic Imperative

Before optimizing, you must:

Identify — Locate the actual bottleneck with data, not intuition
Quantify — Measure the impact: how much time/resources does it consume?
Validate — Confirm that addressing this constraint will improve overall performance
Optimize — Apply targeted fixes to the identified bottleneck
Re-evaluate — After fixing, the bottleneck shifts—find the new one

Bottleneck Identification vs. Shotgun Optimization
Aspect	Bottleneck-Driven	Shotgun Approach
Process	Diagnose → Target → Fix → Verify	Guess → Try multiple fixes → Hope
Effort efficiency	High: focused on actual constraint	Low: wastes effort on non-issues
Risk	Low: data-driven decisions	High: may introduce regressions
Reproducibility	Systematic, teachable	Ad-hoc, personality-dependent
Outcome certainty	High confidence in improvement	Uncertain if changes helped

The 80/20 of Database Performance

Symptoms vs. Root Causes

Symptom: High CPU usage Possible Root Causes:

Inefficient query performing full table scan
Missing index forcing sort operations
Excessive function calls in SELECT
Suboptimal execution plan
Legitimate high load (may need scaling, not optimization)

Symptom: Slow application response Possible Root Causes:

Slow database queries (but which ones?)
Lock contention blocking transactions
Connection pool exhaustion
Network latency to database server
Application-side issues (not database at all)

Common Symptoms

•Queries taking longer than usual
•High CPU utilization
•Memory pressure / swapping
•Disk I/O saturation
•Connection pool exhaustion
•Lock wait timeouts
•Application timeouts

Root Cause Categories

•Missing or suboptimal indexes
•Inefficient query structure
•Poor execution plan choice
•Inadequate resource allocation
•Schema design problems
•Transaction scope issues
•Configuration mistuning

The Five Whys Technique

A structured approach to root cause analysis:

Why is the application slow? → Database queries are taking 5+ seconds
Why are queries slow? → The orders_report query takes 8 seconds
Why does orders_report take 8 seconds? → It's performing a full table scan
Why is it scanning the full table? → No index exists on the filter columns
Why is there no index? → The table grew beyond original design assumptions

Root Cause Identified: Missing index on frequently-filtered columns in a grown dataset Action: Create appropriate index on orders(customer_id, order_date)

Beware Symptom-Chasing

Wait Analysis

Database systems track wait statistics—accumulating time spent waiting for various resource types. Analyzing these statistics reveals the dominant bottlenecks.

Common Wait Categories:

I/O Waits — Waiting for disk reads/writes
Lock Waits — Waiting for other transactions to release locks
Latch Waits — Waiting for internal memory structure access
Network Waits — Waiting for client to consume results
CPU Waits — Ready to run but no available CPU (signal wait)
Memory Waits — Waiting for memory allocation (query grants)

wait-analysis.sql

-- System-wide wait statistics
-- Shows what the database spends most time waiting for
SELECT TOP 20
    wait_type,
    wait_time_ms / 1000.0 AS wait_sec,
    100.0 * wait_time_ms / SUM(wait_time_ms) OVER() AS wait_pct,
    signal_wait_time_ms / 1000.0 AS signal_wait_sec,
    waiting_tasks_count,
    CASE 
        WHEN wait_type LIKE 'LCK_%' THEN 'Lock'
        WHEN wait_type LIKE 'PAGEIO%' THEN 'Disk I/O'
        WHEN wait_type LIKE 'ASYNC_NETWORK%' THEN 'Network'
        WHEN wait_type = 'SOS_SCHEDULER_YIELD' THEN 'CPU Pressure'
        WHEN wait_type LIKE 'LATCH%' THEN 'Latch'
        WHEN wait_type LIKE 'CXPACKET%' THEN 'Parallelism'
        WHEN wait_type LIKE 'RESOURCE_SEMAPHORE%' THEN 'Memory Grant'
        ELSE 'Other'
    END AS wait_category
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    -- Exclude benign/background waits
    'BROKER_TASK_STOP', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',
    'LAZYWRITER_SLEEP', 'SLEEP_TASK', 'WAITFOR'
)
AND wait_time_ms > 0
ORDER BY wait_time_ms DESC;
 
/* Sample Analysis:
   PAGEIOLATCH_SH   45% → System is I/O bound (disk reads)
   LCK_M_X          25% → Significant lock contention
   SOS_SCHEDULER_YIELD 15% → Some CPU pressure
   ASYNC_NETWORK_IO 10% → Clients slow to consume results
   
   Action: Focus on I/O reduction first (45% of waits)
           Then investigate locking issues (25%)
*/
 
-- Wait analysis for specific session
SELECT 
    session_id,
    wait_type,
    wait_time,
    blocking_session_id,
    resource_description
FROM sys.dm_exec_requests
WHERE session_id = @target_session_id;

Wait Event Interpretation Guide
Wait Category	Indicates	Investigation Focus	Common Solutions
Page I/O	Disk read bottleneck	Buffer hit ratio, disk latency	Add indexes, increase buffer pool, faster storage
Lock	Transaction contention	Blocking queries, transaction scope	Optimize transactions, reduce scope, check for deadlocks
CPU Scheduler	CPU saturation	CPU-intensive queries, core count	Optimize queries, add CPU, distribute load
Memory Grant	Memory pressure	Large sorts/hashes, concurrent queries	Reduce work_mem, optimize queries, add RAM
Network	Slow result consumption	Result size, client performance	Reduce columns, add pagination, check network
Latch	Internal contention	Hot pages, tempdb pressure	Reduce tempdb usage, distribute hot objects

Wait Analysis Interpretation

Top-N Query Analysis

Critical Insight: Total vs. Average

A query can appear in top-N lists for different reasons:

High per-execution cost — Each execution is expensive (complex query, large data)
High execution frequency — Individually cheap but executed constantly (hot path)
Both — Expensive AND frequent (urgent priority)

top-n-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Top queries by total execution time
-- These contribute most to overall database load
SELECT 
    queryid,
    calls,
    ROUND(total_exec_time::numeric / 1000, 2) AS total_sec,
    ROUND(mean_exec_time::numeric, 2) AS avg_ms,
    ROUND(100.0 * total_exec_time / 
          SUM(total_exec_time) OVER(), 2) AS pct_of_total,
    rows AS total_rows_returned,
    shared_blks_hit + shared_blks_read AS total_blocks,
    LEFT(query, 80) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Multi-dimensional ranking
-- Find queries that are both slow AND frequent
SELECT 
    queryid,
    calls,
    mean_exec_time AS avg_ms,
    total_exec_time / 1000 AS total_sec,
    -- Impact score: combines frequency and duration
    ROUND((calls * mean_exec_time / 1000)::numeric, 2) AS impact_score,
    NTILE(10) OVER (ORDER BY calls DESC) AS frequency_decile,
    NTILE(10) OVER (ORDER BY mean_exec_time DESC) AS slowness_decile,
    LEFT(query, 60) AS query_preview
FROM pg_stat_statements
WHERE calls > 100  -- Minimum frequency threshold
ORDER BY calls * mean_exec_time DESC
LIMIT 20;
 
-- Queries in top decile for BOTH frequency and slowness
-- are highest priority optimization targets

Top-N Analysis Dimensions

•Total Execution Time — Overall contribution to database load; primary optimization target
•Execution Count — Hot paths; even small improvements multiply across executions
•Average Duration — Slow queries affecting user experience; latency-sensitive targets
•Total Logical Reads — I/O consumers; candidates for indexing
•CPU Time — Compute-intensive; candidates for algorithm optimization
•Temp Disk Usage — Memory-exceeding operations; candidates for work_mem tuning

Reset Statistics Strategically

Execution Plan Analysis

Key Execution Plan Elements

Operators — The individual steps (scan, seek, join, sort, etc.)
Cardinality Estimates — Expected row counts at each step
Actual Row Counts — Real rows processed (available with ANALYZE/actual plans)
Cost Estimates — Relative cost units assigned by optimizer
I/O Statistics — Buffer reads, cache hits, physical reads

execution-plan-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Estimated plan (no execution)
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
 
-- Actual plan with execution statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;
 
/* Example output with bottleneck highlighted:
 
   Seq Scan on orders  (cost=0.00..235864.00 rows=1250 width=120)
     (actual time=0.012..1523.456 rows=1234 loops=1)
     Filter: (customer_id = 12345)
     Rows Removed by Filter: 9998766     <-- BOTTLENECK!
     Buffers: shared hit=1234 read=98765
   Planning Time: 0.123 ms
   Execution Time: 1524.567 ms
   
   Analysis:
   - Seq Scan = reading entire table (no index used)
   - 9.9M rows examined to find 1,234 matches
   - 98,765 pages read from disk
   - Solution: CREATE INDEX idx_orders_customer ON orders(customer_id);
*/
 
-- After adding index:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;
 
/* Improved output:
   Index Scan using idx_orders_customer on orders
     (actual time=0.034..2.567 rows=1234 loops=1)
     Index Cond: (customer_id = 12345)
     Buffers: shared hit=145
   Execution Time: 2.890 ms   <-- 500x improvement!
*/

Execution Plan Red Flags
Pattern	Symptom	Likely Cause	Solution
Table Scan / Seq Scan	Reading entire table	No usable index	Create index on filter/join columns
Estimated ≠ Actual rows (10x+)	Optimizer made wrong choice	Stale statistics	Update statistics, analyze table
Sort with spill warning	Sorting to disk	Insufficient memory	Add index for sort, increase work_mem
Key Lookup / RID Lookup	Extra I/O per row	Non-covering index	Add INCLUDE columns to index
Nested Loop with many outer rows	O(n×m) joins	Missing index on inner table	Add index, consider hash join
Parallelism with CXPACKET waits	Uneven work distribution	Skewed data, bad estimate	Update stats, consider serial plan

Estimates vs. Actuals: The Critical Gap

Lock and Block Analysis

Lock Hierarchy

Databases use hierarchical locking:

Database → Schema → Table → Page → Row

Locks at higher levels (table) block more concurrent access than locks at lower levels (row). Some operations escalate from row to table locks when too many rows are affected.

Blocking Chains

A blocking chain forms when:

Session A holds lock L1
Session B waits for L1, holds L2
Session C waits for L2

Session A is the "head blocker." Killing session A releases the entire chain.

lock-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Current locks and blocking
SELECT 
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocked.wait_event_type,
    blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
 
-- All current locks with details
SELECT 
    l.locktype,
    l.relation::regclass AS table_name,
    l.mode,
    l.granted,
    l.pid,
    a.query,
    EXTRACT(EPOCH FROM (now() - a.query_start)) AS duration_sec
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.relation IS NOT NULL
  AND NOT l.granted  -- Show waiting locks
ORDER BY a.query_start;
 
-- Lock types:
-- AccessShareLock: SELECT
-- RowShareLock: SELECT FOR UPDATE
-- RowExclusiveLock: INSERT, UPDATE, DELETE
-- ShareLock: CREATE INDEX CONCURRENTLY
-- ExclusiveLock: VACUUM, ALTER TABLE

Common Locking Problems

•Long-running transactions — Holding locks while waiting for user input or external systems
•Table-level locks — DDL operations, LOCK TABLE, excessive row locks escalating
•Hot row contention — Multiple transactions updating the same row (counters, status flags)
•Deadlocks — Circular wait chains that require transaction abort
•Lock escalation — Row locks escalating to table locks under memory pressure

Reduce Lock Scope, Not Lock Duration

Systematic Diagnosis Framework

Combining the techniques covered, here's a systematic framework for bottleneck identification that ensures consistent, thorough analysis.

The WAIT→QUERY→PLAN Framework

Step 1: WAIT — What is the system waiting for?

Examine wait statistics to identify dominant wait types
Categorize: I/O, Lock, CPU, Memory, Network
This reveals the resource constraint type

Step 2: QUERY — Which queries contribute most?

Top-N analysis by total resource consumption
Filter for queries related to the constraint type
Identify the specific statements causing issues

Step 3: PLAN — Why are those queries slow?

Examine execution plans for identified queries
Look for red flags: scans, estimate mismatches, spills
Identify specific optimization opportunities

WAIT→QUERY→PLAN Decision Matrix
If WAIT shows...	QUERY focus	PLAN look for...	Typical Fix
Page I/O waits	High logical/physical reads	Table scans, key lookups	Add indexes, covering indexes
Lock waits	Long-running UPDATE/DELETE	Large row counts, table locks	Batch updates, reduce scope
CPU waits	High CPU time per execution	Complex expressions, sorts	Simplify logic, add sort indexes
Memory waits	High temp disk usage	Sorts, hash spills	Increase work_mem, reduce data
Network waits	Large result sets	SELECT *, wide rows	Reduce columns, add pagination

diagnosis-checklist.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Step 1: WAIT ANALYSIS
-- Run wait stats query (see Section 3)
-- Identify top 3 wait categories
 
-- Step 2: QUERY IDENTIFICATION
-- Run top-N query (see Section 4)
-- Focus on queries matching wait category:
--   I/O waits → Sort by physical_reads
--   CPU waits → Sort by cpu_time  
--   Lock waits → Look at UPDATE/DELETE statements
 
-- Step 3: PLAN ANALYSIS
-- For each top query:
EXPLAIN (ANALYZE, BUFFERS) <top_query_here>;
 
-- Checklist for each plan:
-- [ ] Any Seq Scan / Table Scan on large tables?
-- [ ] Estimated rows within 10x of actual?
-- [ ] Any spill warnings (temp files)?
-- [ ] Any Key Lookup / RID Lookup operations?
-- [ ] Join types appropriate for data volumes?
 
-- Step 4: VALIDATE HYPOTHESIS
-- Before fixing, predict the improvement:
-- "Adding index X should eliminate the seq scan,
--  reducing I/O from 100,000 pages to ~100 pages,
--  improving execution from 5s to ~50ms"
 
-- Step 5: IMPLEMENT AND MEASURE
-- Apply fix, re-measure, confirm improvement
-- If improvement < expected, reassess hypothesis

The Feedback Loop

Summary: Bottleneck Identification

Bottleneck identification transforms database performance work from guesswork into systematic engineering. Let's consolidate the essential concepts:

Key Takeaways

•Every system has one bottleneck at a time — Identify the actual constraint before optimizing; non-bottleneck improvements are wasted effort
•Distinguish symptoms from root causes — Use techniques like Five Whys to trace observable problems to underlying issues
•Wait analysis reveals constraint types — Examining what queries wait for directly identifies resource bottlenecks
•Top-N analysis locates high-impact queries — A small number of queries typically cause most resource consumption
•Execution plans explain why queries are slow — Look for scans, estimate mismatches, and spill warnings
•Lock analysis addresses concurrency issues — Identify blocking chains and lock hotspots for concurrent workloads
•Use systematic frameworks — WAIT→QUERY→PLAN provides consistent, thorough diagnosis

What's Next

Page Complete

3 / 5