Query Performance Basics - Learning Module

Loading content...

0/241

Measurement Tools

Instruments for Performance Insight

A mechanic without diagnostic tools can only guess at engine problems. Similarly, a database professional without measurement tools is limited to intuition and trial-and-error. Measurement tools bridge the gap between "something is slow" and "this specific query is slow because it's scanning 10 million rows without an index."

Modern database systems provide rich instrumentation—but this wealth of options can be overwhelming. Understanding which tools to use for which problems, and how to interpret their output, separates effective performance engineers from those who collect data without actionable insight.

What You Will Learn

By the end of this page, you will understand the major categories of performance measurement tools, know when to use each tool type, be able to interpret the key outputs, and understand how to build a performance monitoring strategy using these instruments.

Categories of Measurement Tools

Database performance tools fall into distinct categories, each serving different diagnostic needs. Understanding these categories helps select the right tool for each situation.

Interactive Query Analysis

Tools used during active investigation to understand individual query behavior:

EXPLAIN/EXPLAIN ANALYZE (execution plans)
SET STATISTICS commands (per-query metrics)
Session-level timing commands

Statistical Aggregation

Tools that collect and aggregate performance data over time:

Query statistics views (pg_stat_statements, Query Store)
Wait statistics
I/O and buffer statistics

Real-Time Monitoring

Tools showing current database state:

Active session views (pg_stat_activity, sys.dm_exec_requests)
Lock monitoring
Resource consumption dashboards

Logging and Profiling

Passive collection of query and performance data:

Slow query logs
Query profilers
Trace/extended event systems

Tool Categories and Use Cases
Category	Use Case	Key Strength	Limitation
Interactive Analysis	Debugging specific slow queries	Detailed per-query insight	Manual, doesn't scale to many queries
Statistical Aggregation	Finding top resource consumers	Long-term trends, prioritization	Loses individual query detail
Real-Time Monitoring	Active incident investigation	Shows current state	Misses intermittent issues
Logging/Profiling	Historical analysis, pattern detection	Captures everything	Storage overhead, noise

Layer Multiple Tools

Effective performance monitoring uses tools from multiple categories. Statistical aggregation identifies WHICH queries to investigate. Interactive analysis reveals WHY they're slow. Real-time monitoring catches issues AS they happen. Logging provides HISTORICAL context. Each layer adds insight.

EXPLAIN and Execution Plans

The EXPLAIN command reveals how the database intends to execute a query—the execution plan. This is the most fundamental and frequently-used performance diagnostic tool. Every database professional must understand execution plan analysis.

EXPLAIN Variants

EXPLAIN (basic) — Shows planned execution without running the query. Safe for all queries.
EXPLAIN ANALYZE — Actually executes the query and shows real timing and row counts. Critical for accuracy but actually runs the query (including INSERT/UPDATE/DELETE).
EXPLAIN with options — Additional output (buffers, format, costs, etc.) for deeper analysis.

explain-options.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Basic EXPLAIN: estimated plan only
EXPLAIN SELECT * FROM orders WHERE customer_id = 100;
 
-- ANALYZE: actually execute and show timing
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;
 
-- BUFFERS: show I/O statistics
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders WHERE customer_id = 100;
 
-- TIMING: explicit timing (on by default with ANALYZE)
EXPLAIN (ANALYZE, TIMING) 
SELECT * FROM orders WHERE customer_id = 100;
 
-- FORMAT options: TEXT (default), XML, JSON, YAML
EXPLAIN (ANALYZE, FORMAT JSON) 
SELECT * FROM orders WHERE customer_id = 100;
 
-- Full diagnostic mode
EXPLAIN (ANALYZE, VERBOSE, BUFFERS, FORMAT TEXT)
SELECT o.*, c.name 
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2024-01-01';
 
/* Output includes:
   - Execution tree with actual times
   - Rows estimated vs actual
   - Buffer hits vs reads (I/O)
   - Output columns per node
   - Planning time + Execution time
*/

Execution Plan Reading Checklist

•Scan types: Seq Scan / Table Scan = bad for large tables; Index Scan = generally good
•Row estimates: Compare estimated vs actual; >10x difference suggests stale statistics
•Cost values: Higher cost = more expensive; use for comparing plan alternatives
•Buffers: shared hit = cache; read = disk I/O; high read count = cache miss problem
•Loops: Watch for high loop counts in nested operations
•Spills: Any reference to temp files or disk sorts indicates memory exceeded

EXPLAIN ANALYZE Executes the Query

EXPLAIN ANALYZE actually runs the query, including data modifications. Never use EXPLAIN ANALYZE on UPDATE, DELETE, or INSERT statements unless you intend to execute them. Wrap in a transaction and ROLLBACK if you need actual execution metrics on DML: BEGIN; EXPLAIN ANALYZE UPDATE...; ROLLBACK;

Query Statistics Extensions

Query statistics extensions aggregate performance data across all query executions, enabling identification of high-impact queries without real-time monitoring. These are essential for systematic performance management.

PostgreSQL: pg_stat_statements

The most important PostgreSQL performance extension. Aggregates execution statistics by normalized query signature.

SQL Server: Query Store

Built-in feature (SQL Server 2016+) that captures query plans and runtime statistics. Provides plan forcing, regression detection, and historical analysis.

MySQL: Performance Schema + sys schema

Performance Schema provides detailed instrumentation; sys schema offers user-friendly views over Performance Schema data.

query-statistics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Enable pg_stat_statements (requires server restart initially)
-- postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.max = 10000
-- pg_stat_statements.track = all
 
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Top queries by total time
SELECT 
    queryid,
    calls,
    ROUND(total_exec_time::numeric, 2) AS total_ms,
    ROUND(mean_exec_time::numeric, 2) AS avg_ms,
    ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,
    rows,
    shared_blks_hit,
    shared_blks_read,
    ROUND(100.0 * shared_blks_hit / 
          NULLIF(shared_blks_hit + shared_blks_read, 0), 1) AS hit_pct,
    LEFT(query, 80) AS query_text
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Most variable queries (inconsistent performance)
SELECT 
    queryid,
    calls,
    mean_exec_time,
    stddev_exec_time,
    stddev_exec_time / NULLIF(mean_exec_time, 0) AS coef_of_variation,
    LEFT(query, 80)
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time / NULLIF(mean_exec_time, 0) DESC
LIMIT 10;
 
-- Reset statistics (do periodically for fresh baseline)
SELECT pg_stat_statements_reset();

Query Normalization

Statistics extensions normalize queries by replacing literal values with placeholders. "SELECT * FROM t WHERE id = 1" and "SELECT * FROM t WHERE id = 2" become one aggregated entry: "SELECT * FROM t WHERE id = ?". This enables meaningful aggregation across thousands of individual executions with different parameters.

Wait Statistics and Event Monitoring

Wait statistics reveal what database sessions spend time waiting for. This directly identifies resource bottlenecks without requiring you to examine every query. Modern databases provide extensive wait event instrumentation.

Wait Event Architecture

When a session cannot proceed, it enters a "wait" state with a specific wait type describing what it's waiting for:

I/O waits: waiting for disk reads/writes
Lock waits: waiting for another transaction
Latch waits: waiting for internal memory structures
Network waits: waiting for client

Aggregating these waits reveals system-wide patterns.

wait-statistics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Current wait events (real-time snapshot)
SELECT 
    pid,
    wait_event_type,
    wait_event,
    state,
    EXTRACT(EPOCH FROM (now() - query_start)) AS running_sec,
    LEFT(query, 50) AS query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
  AND state = 'active'
ORDER BY query_start;
 
/* Wait event types:
   Activity   - Server process is idle
   Client     - Waiting for client
   Extension  - Extension-specific wait
   IO         - Disk I/O wait
   IPC        - Inter-process communication
   Lock       - Transaction lock
   LWLock     - Lightweight lock
   Timeout    - Timeout wait
*/
 
-- I/O timing breakdown per table (requires track_io_timing)
SET track_io_timing = on;
 
SELECT 
    relname,
    heap_blks_read,
    heap_blks_hit,
    idx_blks_read,
    idx_blks_hit,
    ROUND(100.0 * heap_blks_hit / 
          NULLIF(heap_blks_hit + heap_blks_read, 0), 1) AS heap_hit_pct
FROM pg_statio_user_tables
ORDER BY heap_blks_read + idx_blks_read DESC
LIMIT 10;
 
-- For comprehensive wait analysis, install pg_wait_sampling
-- CREATE EXTENSION pg_wait_sampling;

Key Wait Types and Their Meaning
Wait Type	Category	Indicates	Response
PAGEIOLATCH_SH/EX	I/O	Disk reads for data pages	Add indexes, increase buffer pool
LCK_M_*	Lock	Transaction lock contention	Optimize transactions, check blocking
CXPACKET	Parallelism	Parallel thread synchronization	Check for skewed parallelism
ASYNC_NETWORK_IO	Network	Client slow to consume results	Reduce result size, check client
SOS_SCHEDULER_YIELD	CPU	CPU scheduling contention	Optimize CPU-heavy queries
WRITELOG	I/O	Transaction log writes	Check log disk performance

Signal Wait vs. Resource Wait

Total wait = Signal wait + Resource wait. Signal wait is time waiting for CPU after resource became available. High signal wait (relative to total) indicates CPU pressure—the process got the resource but had to wait for CPU to continue. High resource wait indicates the resource itself (disk, lock) is the constraint.

Real-Time Monitoring Views

Real-time monitoring views show the current state of the database—active queries, resource consumption, lock status. These are essential during active incidents and for understanding moment-to-moment behavior.

Key Monitoring Targets

Active Sessions — What's currently running? Who's connected?
Long-Running Queries — Potential problems in progress
Blocking Relationships — Who's waiting on whom?
Resource Utilization — Current CPU, memory, I/O state
Lock Status — Current lock holdings and waiters

realtime-monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- All active queries
SELECT 
    pid,
    usename,
    application_name,
    client_addr,
    state,
    EXTRACT(EPOCH FROM (now() - query_start)) AS running_sec,
    wait_event_type,
    wait_event,
    LEFT(query, 60) AS query
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY query_start NULLS LAST;
 
-- Long-running queries (> 60 seconds)
SELECT 
    pid,
    usename,
    EXTRACT(EPOCH FROM (now() - query_start)) AS running_sec,
    state,
    query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > INTERVAL '60 seconds'
  AND pid != pg_backend_pid();
 
-- Current blocking chains
SELECT 
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocked.query AS blocked_query,
    blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
 
-- Database statistics overview
SELECT 
    datname,
    numbackends AS connections,
    xact_commit AS commits,
    xact_rollback AS rollbacks,
    blks_read,
    blks_hit,
    tup_returned,
    tup_fetched,
    tup_inserted,
    tup_updated,
    tup_deleted
FROM pg_stat_database
WHERE datname = current_database();

Sampling vs. Continuous Monitoring

Real-time views show a point-in-time snapshot. A query that takes 0.5 seconds might not be captured if you check every second. For catching short-duration problems, combine real-time monitoring with statistical aggregation (pg_stat_statements, Query Store) and logging (slow query log).

Slow Query Logging

Slow query logs capture queries exceeding a configured duration threshold. This passive approach catches all problematic queries without constant monitoring, making it ideal for production environments.

Configuration Considerations

Threshold selection — Too low captures too much; too high misses problems
Log verbosity — Include execution time, row counts, user, source IP
Rotation and retention — Logs can grow large; plan for storage and cleanup
Performance impact — Logging has minimal overhead but consider I/O

slow-query-logging.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- postgresql.conf settings
 
-- Log queries taking more than 100ms
log_min_duration_statement = 100
 
-- Or log all statements (verbose)
log_statement = 'all'
 
-- Log additional useful information
log_line_prefix = '%t [%p]: user=%u,db=%d,app=%a,client=%h '
log_duration = on
 
-- Auto-explain for slow queries (includes plans!)
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 500  -- Log plans for queries > 500ms
auto_explain.log_analyze = on  -- Include actual execution stats
auto_explain.log_buffers = on  -- Include buffer usage
auto_explain.log_timing = on
 
-- Sample output in PostgreSQL log:
-- 2024-01-15 14:32:15.123 [12345]: user=app,db=prod...
-- LOG: duration: 1523.456 ms  statement: SELECT * FROM orders...
-- LOG: Query plan:
--      Seq Scan on orders (...actual time=0.012..1523.456 rows=1234 loops=1)
--        Filter: (customer_id = 100)
--        Rows Removed by Filter: 9998766
--      Planning Time: 0.123 ms
--      Execution Time: 1523.567 ms

Slow Query Log Best Practices

•Start with a higher threshold (500ms - 1s) to catch only the worst offenders, then lower gradually
•Use log rotation to prevent disk exhaustion; most systems support automatic rotation
•Include execution plans (auto_explain, Extended Events) for immediate root cause visibility
•Monitor log volume — if thousands of slow queries appear, you have systemic issues
•Automate analysis — use tools like pgBadger, pt-query-digest, or Query Store reports

Log Queries Not Using Indexes

Consider logging queries that don't use indexes (MySQL: log_queries_not_using_indexes). This catches inefficient queries regardless of execution time—they may be fast only because the table is small. When the table grows, they'll become slow.

Third-Party Monitoring Tools

While built-in tools are powerful, third-party monitoring solutions provide enhanced visualization, alerting, historical analysis, and correlation across multiple databases. These are especially valuable for larger environments.

Tool Categories

Database-Specific APM — Deep monitoring for specific database platforms
General APM — Application performance monitoring with database visibility
Open-Source Solutions — Community-driven monitoring stacks
Cloud Provider Tools — Native monitoring for managed databases

Third-Party Monitoring Tool Landscape
Tool	Category	Strengths	Considerations
Datadog	General APM	Unified dashboard, AI-assisted analysis, cloud integration	Cost scales with data volume
New Relic	General APM	Deep transaction tracing, broad language support	Learning curve, pricing
SolarWinds DPA	Database-Specific	Excellent wait analysis, VMware/SQL Server focus	On-premises, licensing
pganalyze	PostgreSQL	Query insights, index recommendations, vacuum analysis	PostgreSQL-only
Percona Monitoring (PMM)	Open Source	Free, MySQL/PostgreSQL/MongoDB, comprehensive	Self-hosted maintenance
Prometheus + Grafana	Open Source	Highly customizable, extensible, cloud-native	Requires setup expertise
AWS RDS Performance Insights	Cloud Native	Integrated with AWS, low setup	AWS databases only
Azure SQL Analytics	Cloud Native	Azure integration, AI recommendations	Azure SQL only

Building a Monitoring Stack

A comprehensive monitoring strategy typically includes:

Metrics Collection — pg_stat_statements, Performance Schema, or agent-based collection
Time-Series Database — Prometheus, InfluxDB, or cloud provider metrics
Visualization — Grafana dashboards, built-in tool UIs
Alerting — Threshold and anomaly-based alerts
Log Aggregation — ELK stack, Splunk, or cloud logging for slow query analysis

prometheus-exporter.yaml
Prometheus/Grafana
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# prometheus.yml - Scrape PostgreSQL exporter
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres-exporter:9187']
    
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysqld-exporter:9104']
 
# Alerting rules example
# alerts.yml
groups:
  - name: database_alerts
    rules:
      - alert: HighQueryDuration
        expr: pg_stat_statements_mean_time_seconds > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Query average time exceeds 1 second"
          
      - alert: LowCacheHitRatio
        expr: pg_stat_database_blks_hit / 
              (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Buffer cache hit ratio below 95%"
          
      - alert: ConnectionsSaturated
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Connection usage above 80%"

Tool Selection Criteria

When selecting monitoring tools, consider: deployment model (SaaS vs. self-hosted), database platform support, integration with existing infrastructure, total cost at your scale, team expertise, and vendor lock-in implications. Sometimes the best tool is the one your team will actually use.

Summary: Measurement Tools

Effective performance diagnosis requires mastery of measurement tools across multiple categories. Let's consolidate the essential concepts:

Key Takeaways

•Use tools from multiple categories — Interactive analysis, statistics, real-time monitoring, and logging each serve different needs
•EXPLAIN is foundational — Every database professional must understand execution plan analysis
•Query statistics extensions are essential — pg_stat_statements, Query Store, and Performance Schema enable systematic performance management
•Wait statistics reveal bottleneck type — Analyzing what sessions wait for directly identifies resource constraints
•Real-time views catch incidents — Active session monitoring is critical during performance events
•Slow query logging provides history — Passive capture catches all problematic queries for later analysis
•Third-party tools enhance visibility — For larger environments, dedicated monitoring tools provide alerting, visualization, and correlation

What's Next

With measurement tools in hand, we need to know what we're measuring toward. The final page in this module explores Performance Goals—defining targets, establishing SLAs, setting baselines, and creating the framework for continuous performance improvement.

Page Complete

You now have a comprehensive toolkit for database performance measurement. You understand when to use EXPLAIN, how to leverage query statistics extensions, how to analyze wait events, and when to employ logging versus real-time monitoring versus third-party solutions. Next, we'll learn to define and pursue meaningful performance goals.