Loading content...
A mechanic without diagnostic tools can only guess at engine problems. Similarly, a database professional without measurement tools is limited to intuition and trial-and-error. Measurement tools bridge the gap between "something is slow" and "this specific query is slow because it's scanning 10 million rows without an index."
Modern database systems provide rich instrumentation—but this wealth of options can be overwhelming. Understanding which tools to use for which problems, and how to interpret their output, separates effective performance engineers from those who collect data without actionable insight.
By the end of this page, you will understand the major categories of performance measurement tools, know when to use each tool type, be able to interpret the key outputs, and understand how to build a performance monitoring strategy using these instruments.
Database performance tools fall into distinct categories, each serving different diagnostic needs. Understanding these categories helps select the right tool for each situation.
Interactive Query Analysis
Tools used during active investigation to understand individual query behavior:
Statistical Aggregation
Tools that collect and aggregate performance data over time:
Real-Time Monitoring
Tools showing current database state:
Logging and Profiling
Passive collection of query and performance data:
| Category | Use Case | Key Strength | Limitation |
|---|---|---|---|
| Interactive Analysis | Debugging specific slow queries | Detailed per-query insight | Manual, doesn't scale to many queries |
| Statistical Aggregation | Finding top resource consumers | Long-term trends, prioritization | Loses individual query detail |
| Real-Time Monitoring | Active incident investigation | Shows current state | Misses intermittent issues |
| Logging/Profiling | Historical analysis, pattern detection | Captures everything | Storage overhead, noise |
Effective performance monitoring uses tools from multiple categories. Statistical aggregation identifies WHICH queries to investigate. Interactive analysis reveals WHY they're slow. Real-time monitoring catches issues AS they happen. Logging provides HISTORICAL context. Each layer adds insight.
The EXPLAIN command reveals how the database intends to execute a query—the execution plan. This is the most fundamental and frequently-used performance diagnostic tool. Every database professional must understand execution plan analysis.
EXPLAIN Variants
1234567891011121314151617181920212223242526272829303132
-- Basic EXPLAIN: estimated plan onlyEXPLAIN SELECT * FROM orders WHERE customer_id = 100; -- ANALYZE: actually execute and show timingEXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100; -- BUFFERS: show I/O statisticsEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 100; -- TIMING: explicit timing (on by default with ANALYZE)EXPLAIN (ANALYZE, TIMING) SELECT * FROM orders WHERE customer_id = 100; -- FORMAT options: TEXT (default), XML, JSON, YAMLEXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 100; -- Full diagnostic modeEXPLAIN (ANALYZE, VERBOSE, BUFFERS, FORMAT TEXT)SELECT o.*, c.name FROM orders oJOIN customers c ON o.customer_id = c.customer_idWHERE o.order_date > '2024-01-01'; /* Output includes: - Execution tree with actual times - Rows estimated vs actual - Buffer hits vs reads (I/O) - Output columns per node - Planning time + Execution time*/EXPLAIN ANALYZE actually runs the query, including data modifications. Never use EXPLAIN ANALYZE on UPDATE, DELETE, or INSERT statements unless you intend to execute them. Wrap in a transaction and ROLLBACK if you need actual execution metrics on DML: BEGIN; EXPLAIN ANALYZE UPDATE...; ROLLBACK;
Query statistics extensions aggregate performance data across all query executions, enabling identification of high-impact queries without real-time monitoring. These are essential for systematic performance management.
PostgreSQL: pg_stat_statements
The most important PostgreSQL performance extension. Aggregates execution statistics by normalized query signature.
SQL Server: Query Store
Built-in feature (SQL Server 2016+) that captures query plans and runtime statistics. Provides plan forcing, regression detection, and historical analysis.
MySQL: Performance Schema + sys schema
Performance Schema provides detailed instrumentation; sys schema offers user-friendly views over Performance Schema data.
12345678910111213141516171819202122232425262728293031323334353637383940
-- Enable pg_stat_statements (requires server restart initially)-- postgresql.conf:-- shared_preload_libraries = 'pg_stat_statements'-- pg_stat_statements.max = 10000-- pg_stat_statements.track = all CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Top queries by total timeSELECT queryid, calls, ROUND(total_exec_time::numeric, 2) AS total_ms, ROUND(mean_exec_time::numeric, 2) AS avg_ms, ROUND(stddev_exec_time::numeric, 2) AS stddev_ms, rows, shared_blks_hit, shared_blks_read, ROUND(100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0), 1) AS hit_pct, LEFT(query, 80) AS query_textFROM pg_stat_statementsORDER BY total_exec_time DESCLIMIT 20; -- Most variable queries (inconsistent performance)SELECT queryid, calls, mean_exec_time, stddev_exec_time, stddev_exec_time / NULLIF(mean_exec_time, 0) AS coef_of_variation, LEFT(query, 80)FROM pg_stat_statementsWHERE calls > 100ORDER BY stddev_exec_time / NULLIF(mean_exec_time, 0) DESCLIMIT 10; -- Reset statistics (do periodically for fresh baseline)SELECT pg_stat_statements_reset();Statistics extensions normalize queries by replacing literal values with placeholders. "SELECT * FROM t WHERE id = 1" and "SELECT * FROM t WHERE id = 2" become one aggregated entry: "SELECT * FROM t WHERE id = ?". This enables meaningful aggregation across thousands of individual executions with different parameters.
Wait statistics reveal what database sessions spend time waiting for. This directly identifies resource bottlenecks without requiring you to examine every query. Modern databases provide extensive wait event instrumentation.
Wait Event Architecture
When a session cannot proceed, it enters a "wait" state with a specific wait type describing what it's waiting for:
Aggregating these waits reveals system-wide patterns.
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Current wait events (real-time snapshot)SELECT pid, wait_event_type, wait_event, state, EXTRACT(EPOCH FROM (now() - query_start)) AS running_sec, LEFT(query, 50) AS queryFROM pg_stat_activityWHERE wait_event IS NOT NULL AND state = 'active'ORDER BY query_start; /* Wait event types: Activity - Server process is idle Client - Waiting for client Extension - Extension-specific wait IO - Disk I/O wait IPC - Inter-process communication Lock - Transaction lock LWLock - Lightweight lock Timeout - Timeout wait*/ -- I/O timing breakdown per table (requires track_io_timing)SET track_io_timing = on; SELECT relname, heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit, ROUND(100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0), 1) AS heap_hit_pctFROM pg_statio_user_tablesORDER BY heap_blks_read + idx_blks_read DESCLIMIT 10; -- For comprehensive wait analysis, install pg_wait_sampling-- CREATE EXTENSION pg_wait_sampling;| Wait Type | Category | Indicates | Response |
|---|---|---|---|
| PAGEIOLATCH_SH/EX | I/O | Disk reads for data pages | Add indexes, increase buffer pool |
| LCK_M_* | Lock | Transaction lock contention | Optimize transactions, check blocking |
| CXPACKET | Parallelism | Parallel thread synchronization | Check for skewed parallelism |
| ASYNC_NETWORK_IO | Network | Client slow to consume results | Reduce result size, check client |
| SOS_SCHEDULER_YIELD | CPU | CPU scheduling contention | Optimize CPU-heavy queries |
| WRITELOG | I/O | Transaction log writes | Check log disk performance |
Total wait = Signal wait + Resource wait. Signal wait is time waiting for CPU after resource became available. High signal wait (relative to total) indicates CPU pressure—the process got the resource but had to wait for CPU to continue. High resource wait indicates the resource itself (disk, lock) is the constraint.
Real-time monitoring views show the current state of the database—active queries, resource consumption, lock status. These are essential during active incidents and for understanding moment-to-moment behavior.
Key Monitoring Targets
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
-- All active queriesSELECT pid, usename, application_name, client_addr, state, EXTRACT(EPOCH FROM (now() - query_start)) AS running_sec, wait_event_type, wait_event, LEFT(query, 60) AS queryFROM pg_stat_activityWHERE pid != pg_backend_pid()ORDER BY query_start NULLS LAST; -- Long-running queries (> 60 seconds)SELECT pid, usename, EXTRACT(EPOCH FROM (now() - query_start)) AS running_sec, state, queryFROM pg_stat_activityWHERE state = 'active' AND now() - query_start > INTERVAL '60 seconds' AND pid != pg_backend_pid(); -- Current blocking chainsSELECT blocked.pid AS blocked_pid, blocked.usename AS blocked_user, blocking.pid AS blocking_pid, blocking.usename AS blocking_user, blocked.query AS blocked_query, blocking.query AS blocking_queryFROM pg_stat_activity blockedJOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)); -- Database statistics overviewSELECT datname, numbackends AS connections, xact_commit AS commits, xact_rollback AS rollbacks, blks_read, blks_hit, tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deletedFROM pg_stat_databaseWHERE datname = current_database();Real-time views show a point-in-time snapshot. A query that takes 0.5 seconds might not be captured if you check every second. For catching short-duration problems, combine real-time monitoring with statistical aggregation (pg_stat_statements, Query Store) and logging (slow query log).
Slow query logs capture queries exceeding a configured duration threshold. This passive approach catches all problematic queries without constant monitoring, making it ideal for production environments.
Configuration Considerations
12345678910111213141516171819202122232425262728
-- postgresql.conf settings -- Log queries taking more than 100mslog_min_duration_statement = 100 -- Or log all statements (verbose)log_statement = 'all' -- Log additional useful informationlog_line_prefix = '%t [%p]: user=%u,db=%d,app=%a,client=%h 'log_duration = on -- Auto-explain for slow queries (includes plans!)shared_preload_libraries = 'auto_explain'auto_explain.log_min_duration = 500 -- Log plans for queries > 500msauto_explain.log_analyze = on -- Include actual execution statsauto_explain.log_buffers = on -- Include buffer usageauto_explain.log_timing = on -- Sample output in PostgreSQL log:-- 2024-01-15 14:32:15.123 [12345]: user=app,db=prod...-- LOG: duration: 1523.456 ms statement: SELECT * FROM orders...-- LOG: Query plan:-- Seq Scan on orders (...actual time=0.012..1523.456 rows=1234 loops=1)-- Filter: (customer_id = 100)-- Rows Removed by Filter: 9998766-- Planning Time: 0.123 ms-- Execution Time: 1523.567 msConsider logging queries that don't use indexes (MySQL: log_queries_not_using_indexes). This catches inefficient queries regardless of execution time—they may be fast only because the table is small. When the table grows, they'll become slow.
While built-in tools are powerful, third-party monitoring solutions provide enhanced visualization, alerting, historical analysis, and correlation across multiple databases. These are especially valuable for larger environments.
Tool Categories
| Tool | Category | Strengths | Considerations |
|---|---|---|---|
| Datadog | General APM | Unified dashboard, AI-assisted analysis, cloud integration | Cost scales with data volume |
| New Relic | General APM | Deep transaction tracing, broad language support | Learning curve, pricing |
| SolarWinds DPA | Database-Specific | Excellent wait analysis, VMware/SQL Server focus | On-premises, licensing |
| pganalyze | PostgreSQL | Query insights, index recommendations, vacuum analysis | PostgreSQL-only |
| Percona Monitoring (PMM) | Open Source | Free, MySQL/PostgreSQL/MongoDB, comprehensive | Self-hosted maintenance |
| Prometheus + Grafana | Open Source | Highly customizable, extensible, cloud-native | Requires setup expertise |
| AWS RDS Performance Insights | Cloud Native | Integrated with AWS, low setup | AWS databases only |
| Azure SQL Analytics | Cloud Native | Azure integration, AI recommendations | Azure SQL only |
Building a Monitoring Stack
A comprehensive monitoring strategy typically includes:
123456789101112131415161718192021222324252627282930313233343536373839404142
# prometheus.yml - Scrape PostgreSQL exporterglobal: scrape_interval: 15s scrape_configs: - job_name: 'postgresql' static_configs: - targets: ['postgres-exporter:9187'] - job_name: 'mysql' static_configs: - targets: ['mysqld-exporter:9104'] # Alerting rules example# alerts.ymlgroups: - name: database_alerts rules: - alert: HighQueryDuration expr: pg_stat_statements_mean_time_seconds > 1 for: 5m labels: severity: warning annotations: summary: "Query average time exceeds 1 second" - alert: LowCacheHitRatio expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95 for: 10m labels: severity: critical annotations: summary: "Buffer cache hit ratio below 95%" - alert: ConnectionsSaturated expr: pg_stat_activity_count / pg_settings_max_connections > 0.8 for: 5m labels: severity: warning annotations: summary: "Connection usage above 80%"When selecting monitoring tools, consider: deployment model (SaaS vs. self-hosted), database platform support, integration with existing infrastructure, total cost at your scale, team expertise, and vendor lock-in implications. Sometimes the best tool is the one your team will actually use.
Effective performance diagnosis requires mastery of measurement tools across multiple categories. Let's consolidate the essential concepts:
What's Next
With measurement tools in hand, we need to know what we're measuring toward. The final page in this module explores Performance Goals—defining targets, establishing SLAs, setting baselines, and creating the framework for continuous performance improvement.
You now have a comprehensive toolkit for database performance measurement. You understand when to use EXPLAIN, how to leverage query statistics extensions, how to analyze wait events, and when to employ logging versus real-time monitoring versus third-party solutions. Next, we'll learn to define and pursue meaningful performance goals.