Loading content...
A production database executes thousands—often millions—of queries daily. Among this deluge, only a fraction truly need optimization. The challenge isn't optimizing a known slow query; it's systematically discovering which queries are slow in the first place.
Slow query identification transforms performance work from reactive firefighting into proactive engineering. Instead of waiting for users to complain about unresponsive applications, you establish detection systems that surface problematic queries before they impact user experience.
This page establishes systematic approaches for discovering slow queries across different database platforms, using techniques that scale from single-developer applications to enterprise data platforms.
By the end of this page, you will understand how to configure slow query logging, leverage aggregate statistics for workload analysis, apply threshold-based detection strategies, and use heat map approaches to identify optimization opportunities efficiently. You'll be equipped to systematically surface the queries that deserve your attention.
Why is slow query identification challenging? Consider the characteristics of production workloads:
Volume: A busy database might execute 10,000+ queries per second. Manual review is impossible.
Variability: The same query template with different parameter values may perform vastly differently. A user lookup by ID might be instant for user 12345 but slow for user 67890 due to data skew.
Context Dependence: A query that's fast in isolation might be slow during peak hours due to resource contention. Time-of-day and concurrent workload matter.
Hidden Cost: A cheap query executed millions of times may consume more total resources than an expensive query executed once. Per-execution time isn't the only metric.
Intermittent Issues: Some queries are slow only occasionally—when statistics are stale, when cache is cold, when locks are held. Point-in-time measurement misses these cases.
In most workloads, 20% of query patterns consume 80% of resources. Effective discovery focuses on finding this vital few—the query patterns that, once optimized, dramatically improve overall system performance. Don't try to optimize everything; optimize what matters.
Slow query logs are the foundational discovery mechanism—database infrastructure that automatically captures queries exceeding a configured time threshold. Every major database supports this capability, though implementation details vary.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
-- ================================================================-- Slow Query Log Configuration Across Platforms-- ================================================================ -- =========================-- MySQL Slow Query Log-- ========================= -- Check current slow query log settingsSHOW VARIABLES LIKE '%slow_query%';SHOW VARIABLES LIKE '%long_query_time%'; -- Enable slow query logging (requires restart or dynamic variables)SET GLOBAL slow_query_log = 'ON';SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log'; -- Set threshold to 1 second (queries >= 1 second are logged)SET GLOBAL long_query_time = 1.0; -- Log queries not using indexes (optional, powerful but verbose)SET GLOBAL log_queries_not_using_indexes = 'ON';SET GLOBAL log_throttle_queries_not_using_indexes = 60; -- Max per minute -- For persistent configuration, add to my.cnf:/*[mysqld]slow_query_log = 1slow_query_log_file = /var/log/mysql/slow.loglong_query_time = 1.0log_queries_not_using_indexes = 1log_throttle_queries_not_using_indexes = 60*/ -- =========================-- MySQL: Parse slow query log with mysqldumpslow-- ========================= /*Command line:$ mysqldumpslow -s t -t 10 /var/log/mysql/slow.log Options: -s t Sort by total time -s c Sort by count (frequency) -s at Sort by average time -t 10 Top 10 results Output normalizes queries (replaces literals with N, S):Count: 1245 Time=2.50s (3112s) Lock=0.00s (0s) Rows=100.0 (124500) SELECT * FROM orders WHERE customer_id = N This reveals:- Query pattern executed 1245 times- Average 2.5s per execution- Total 3112 seconds consumed- Returns ~100 rows average*/ -- =========================-- PostgreSQL: Log Configuration-- ========================= -- View current logging configurationSHOW log_min_duration_statement;SHOW log_statement; -- Configure in postgresql.conf:/*# Log all statements taking more than 1 secondlog_min_duration_statement = 1000 # milliseconds # Alternative: log all statements for analysislog_statement = 'all' # WARNING: Very verbose # Log duration of all statements (without query text)log_duration = on # Enhanced logging with timinglog_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' # Log parameters for prepared statementslog_parameters_on_error = on*/ -- View logged queries in PostgreSQL log file or:SELECT * FROM pg_stat_activity WHERE state != 'idle'; -- =========================-- PostgreSQL: auto_explain for automatic EXPLAIN on slow queries-- ========================= /*Load module in postgresql.conf:shared_preload_libraries = 'auto_explain' Configure:auto_explain.log_min_duration = '1s' # Thresholdauto_explain.log_analyze = on # Include ANALYZE outputauto_explain.log_buffers = on # Include buffer statsauto_explain.log_timing = on # Include timingauto_explain.log_triggers = on # Include trigger timeauto_explain.log_verbose = on # Verbose outputauto_explain.log_nested_statements = on # Log nested statements Result: Slow queries automatically get EXPLAIN ANALYZE in logsInvaluable for production diagnosis without manual intervention*/ -- =========================-- SQL Server: Extended Events for Slow Query Capture-- ========================= -- Create XE session to capture slow queriesCREATE EVENT SESSION SlowQueryCapture ON SERVERADD EVENT sqlserver.sql_statement_completed ( ACTION ( sqlserver.sql_text, sqlserver.database_name, sqlserver.username, sqlserver.client_app_name, sqlserver.query_hash, sqlserver.query_plan_hash ) WHERE ( duration >= 1000000 -- 1 second in microseconds )),ADD EVENT sqlserver.rpc_completed ( ACTION ( sqlserver.sql_text, sqlserver.database_name, sqlserver.username, sqlserver.query_hash ) WHERE ( duration >= 1000000 ))ADD TARGET package0.event_file ( SET filename = N'SlowQueries.xel', max_file_size = 100 -- MB)WITH ( MAX_MEMORY = 4096 KB, EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY = 5 SECONDS, STARTUP_STATE = ON -- Auto-start with server); -- Start the sessionALTER EVENT SESSION SlowQueryCapture ON SERVER STATE = START; -- Query captured slow queriesSELECT event_data.value('(event/@timestamp)[1]', 'datetime2') AS event_time, event_data.value('(event/data[@name="duration"]/value)[1]', 'bigint') / 1000000.0 AS duration_sec, event_data.value('(event/data[@name="cpu_time"]/value)[1]', 'bigint') / 1000000.0 AS cpu_sec, event_data.value('(event/data[@name="logical_reads"]/value)[1]', 'bigint') AS logical_reads, event_data.value('(event/action[@name="database_name"]/value)[1]', 'nvarchar(128)') AS database_name, event_data.value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS sql_textFROM ( SELECT CAST(event_data AS XML) AS event_data FROM sys.fn_xe_file_target_read_file('SlowQueries*.xel', NULL, NULL, NULL)) AS eventsORDER BY event_time DESC;Setting thresholds too low generates overwhelming log volume; too high misses important queries. Start with 1-2 seconds for production systems, lower for development. Adjust based on log volume and SLA requirements. A threshold catching 0.1% of queries is typically manageable.
While slow query logs capture individual occurrences, aggregate statistics reveal patterns across many executions. A query taking 50ms isn't necessarily a problem—unless it executes 100,000 times per hour, consuming 83 minutes of total CPU time.
Aggregate analysis identifies query patterns by total resource consumption, not just per-execution cost.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
-- ================================================================-- Aggregate Statistics Analysis for Query Discovery-- ================================================================ -- =========================-- PostgreSQL: pg_stat_statements Analysis-- ========================= -- Top queries by TOTAL time (most cumulative impact)SELECT calls, total_exec_time::NUMERIC(12,2) AS total_ms, mean_exec_time::NUMERIC(12,2) AS avg_ms, stddev_exec_time::NUMERIC(12,2) AS stddev_ms, ROUND(100.0 * total_exec_time / SUM(total_exec_time) OVER(), 2) AS pct_of_total, rows, shared_blks_hit + shared_blks_read AS total_blocks, LEFT(query, 80) AS query_patternFROM pg_stat_statementsWHERE calls > 10 -- Filter noiseORDER BY total_exec_time DESCLIMIT 20; /*Interpretation:- Query with 50ms avg but 100K calls = 83 minutes total- Query with 10s avg but 50 calls = 8 minutes total- First query has 10x more impact on system Focus on queries with highest total_exec_time first*/ -- Top queries by AVERAGE time (worst individual performance)SELECT calls, mean_exec_time::NUMERIC(12,2) AS avg_ms, min_exec_time::NUMERIC(12,2) AS min_ms, max_exec_time::NUMERIC(12,2) AS max_ms, stddev_exec_time::NUMERIC(12,2) AS stddev_ms, rows / calls AS avg_rows, LEFT(query, 80) AS query_patternFROM pg_stat_statementsWHERE calls >= 100 -- Sufficient sample sizeORDER BY mean_exec_time DESCLIMIT 20; -- Queries with HIGH VARIABILITY (inconsistent performance)SELECT calls, mean_exec_time::NUMERIC(10,2) AS avg_ms, stddev_exec_time::NUMERIC(10,2) AS stddev_ms, min_exec_time::NUMERIC(10,2) AS min_ms, max_exec_time::NUMERIC(10,2) AS max_ms, (max_exec_time / NULLIF(min_exec_time, 0.001))::NUMERIC(10,1) AS variance_ratio, LEFT(query, 80) AS query_patternFROM pg_stat_statementsWHERE calls > 100 AND mean_exec_time > 10 -- Non-trivial queries AND max_exec_time > mean_exec_time * 10 -- High variabilityORDER BY variance_ratio DESCLIMIT 20; /*High variability indicates:- Parameter-dependent performance (data skew)- Intermittent resource contention- Plan instability (different plans chosen)- Cache-dependent behavior These queries often cause unpredictable user experience*/ -- =========================-- MySQL: Performance Schema Digest Analysis-- ========================= -- Top query patterns by total timeSELECT SCHEMA_NAME AS db, DIGEST_TEXT, COUNT_STAR AS exec_count, ROUND(SUM_TIMER_WAIT / 1000000000000, 2) AS total_time_sec, ROUND(AVG_TIMER_WAIT / 1000000000, 2) AS avg_time_ms, SUM_ROWS_EXAMINED AS total_rows_examined, SUM_ROWS_SENT AS total_rows_sent, SUM_ROWS_EXAMINED / NULLIF(SUM_ROWS_SENT, 0) AS examine_to_send_ratio, ROUND(100.0 * SUM_TIMER_WAIT / (SELECT SUM(SUM_TIMER_WAIT) FROM performance_schema.events_statements_summary_by_digest), 2) AS pct_of_totalFROM performance_schema.events_statements_summary_by_digestWHERE SCHEMA_NAME IS NOT NULL AND COUNT_STAR > 10ORDER BY SUM_TIMER_WAIT DESCLIMIT 20; -- Queries causing most I/O (rows examined)SELECT DIGEST_TEXT, COUNT_STAR AS exec_count, SUM_ROWS_EXAMINED AS total_rows, SUM_ROWS_EXAMINED / COUNT_STAR AS avg_rows, ROUND(SUM_TIMER_WAIT / 1000000000000, 2) AS total_time_sec, SUM_CREATED_TMP_TABLES AS temp_tables, SUM_CREATED_TMP_DISK_TABLES AS disk_temp_tables, SUM_NO_INDEX_USED AS no_index_countFROM performance_schema.events_statements_summary_by_digestWHERE SUM_ROWS_EXAMINED > 100000 -- Significant I/OORDER BY SUM_ROWS_EXAMINED DESCLIMIT 20; -- =========================-- SQL Server: Query Store Aggregate Analysis-- ========================= -- Top queries by total CPU consumptionSELECT TOP 20 q.query_id, qt.query_sql_text, SUM(rs.count_executions) AS total_executions, SUM(rs.avg_cpu_time * rs.count_executions) / 1000000.0 AS total_cpu_sec, AVG(rs.avg_cpu_time) / 1000.0 AS avg_cpu_ms, SUM(rs.avg_logical_io_reads * rs.count_executions) AS total_logical_reads, ROUND(100.0 * SUM(rs.avg_cpu_time * rs.count_executions) / (SELECT SUM(rs2.avg_cpu_time * rs2.count_executions) FROM sys.query_store_runtime_stats rs2), 2) AS pct_total_cpuFROM sys.query_store_query qJOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_idJOIN sys.query_store_plan p ON q.query_id = p.query_idJOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_idGROUP BY q.query_id, qt.query_sql_textORDER BY total_cpu_sec DESC; -- Queries with performance regression (recent worse than historical)SELECT q.query_id, LEFT(qt.query_sql_text, 100) AS query_text, rsi_recent.avg_duration / 1000.0 AS recent_avg_ms, rsi_historical.avg_duration / 1000.0 AS historical_avg_ms, rsi_recent.avg_duration / NULLIF(rsi_historical.avg_duration, 0) AS regression_factorFROM sys.query_store_query qJOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_idJOIN sys.query_store_plan p ON q.query_id = p.query_idJOIN ( SELECT plan_id, AVG(avg_duration) AS avg_duration FROM sys.query_store_runtime_stats rs JOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id WHERE rsi.start_time > DATEADD(day, -1, GETUTCDATE()) -- Last day GROUP BY plan_id) rsi_recent ON p.plan_id = rsi_recent.plan_idJOIN ( SELECT plan_id, AVG(avg_duration) AS avg_duration FROM sys.query_store_runtime_stats rs JOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id WHERE rsi.start_time BETWEEN DATEADD(day, -30, GETUTCDATE()) AND DATEADD(day, -7, GETUTCDATE()) -- 4-week history GROUP BY plan_id) rsi_historical ON p.plan_id = rsi_historical.plan_idWHERE rsi_recent.avg_duration > rsi_historical.avg_duration * 2 -- 2x regressionORDER BY regression_factor DESC;Combine total time with optimization potential. A query consuming 20% of CPU with easy optimization (missing index) is more actionable than one consuming 30% but already well-optimized. Consider: Impact × Optimizability = Priority Score.
Average execution time can be misleading. A query averaging 100ms might have:
The average looks acceptable, but 1 in 20 users experiences terrible latency. Percentile-based detection surfaces these outliers that disproportionately impact user experience.
| Metric | Meaning | Use Case |
|---|---|---|
| p50 (Median) | Typical user experience | Baseline performance expectation |
| p90 | Slowest 10% of requests | Identifies common slow cases |
| p95 | Slowest 5% of requests | Standard SLA target for critical paths |
| p99 | Slowest 1% of requests | Tail latency for high-value operations |
| p99.9 | Worst 0.1% (1 in 1000) | Extreme outliers, often hardware/contention |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
-- ================================================================-- Percentile-Based Slow Query Detection-- ================================================================ -- =========================-- PostgreSQL: Percentile Analysis-- ========================= -- Using pg_stat_statements (no built-in percentiles, use stddev heuristic)SELECT query, calls, mean_exec_time::NUMERIC(10,2) AS avg_ms, stddev_exec_time::NUMERIC(10,2) AS stddev_ms, min_exec_time::NUMERIC(10,2) AS min_ms, max_exec_time::NUMERIC(10,2) AS max_ms, -- Estimated percentiles using normal distribution assumption (mean_exec_time + 1.28 * stddev_exec_time)::NUMERIC(10,2) AS est_p90_ms, (mean_exec_time + 1.65 * stddev_exec_time)::NUMERIC(10,2) AS est_p95_ms, (mean_exec_time + 2.33 * stddev_exec_time)::NUMERIC(10,2) AS est_p99_ms, -- Identify queries with tail latency issues CASE WHEN max_exec_time > mean_exec_time * 20 THEN 'SEVERE TAIL LATENCY' WHEN max_exec_time > mean_exec_time * 10 THEN 'MODERATE TAIL LATENCY' WHEN max_exec_time > mean_exec_time * 5 THEN 'SOME TAIL LATENCY' ELSE 'CONSISTENT PERFORMANCE' END AS tail_latency_statusFROM pg_stat_statementsWHERE calls > 100ORDER BY (max_exec_time / NULLIF(mean_exec_time, 0.001)) DESCLIMIT 20; /*Note: pg_stat_statements doesn't store histogramsFor true percentiles, use:- pg_stat_kcache extension- External log analysis tools- Application-level instrumentation*/ -- =========================-- SQL Server: Query Store Percentile Analysis-- ========================= -- Built-in percentile support in Query StoreSELECT q.query_id, LEFT(qt.query_sql_text, 80) AS query_preview, SUM(rs.count_executions) AS executions, AVG(rs.avg_duration) / 1000.0 AS avg_ms, -- Percentile columns (actual percentiles from histograms) AVG(rs.avg_duration) / 1000.0 AS avg_ms, MIN(rs.min_duration) / 1000.0 AS min_ms, MAX(rs.max_duration) / 1000.0 AS max_ms, -- Stdev as proxy for percentile spread STDEV(rs.avg_duration) / 1000.0 AS stddev_ms, -- Identify tail latency issues MAX(rs.max_duration) / NULLIF(AVG(rs.avg_duration), 0) AS max_to_avg_ratioFROM sys.query_store_query qJOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id JOIN sys.query_store_plan p ON q.query_id = p.query_idJOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_idGROUP BY q.query_id, qt.query_sql_textHAVING SUM(rs.count_executions) > 100ORDER BY max_to_avg_ratio DESC; -- For actual percentile calculation, use sys.query_store_wait_stats-- or application-side instrumentation with histograms -- =========================-- Creating a Custom Percentile Tracking View (Concept)-- ========================= /*For production percentile tracking, consider: 1. HISTOGRAM BUCKETS approach: - Define latency buckets: 0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+ - Count queries in each bucket - Calculate percentiles from cumulative distribution 2. SAMPLING approach: - Log every Nth query with full timing - Build distribution from samples - Trade accuracy for lower overhead 3. APPLICATION INSTRUMENTATION: - Capture timing at application layer - Store in time-series database (Prometheus, InfluxDB) - Calculate percentiles across windows Example bucket approach:*/ -- SQL Server: Create latency histogramSELECT CASE WHEN rs.avg_duration < 10000 THEN '0-10ms' WHEN rs.avg_duration < 50000 THEN '10-50ms' WHEN rs.avg_duration < 100000 THEN '50-100ms' WHEN rs.avg_duration < 500000 THEN '100-500ms' WHEN rs.avg_duration < 1000000 THEN '500ms-1s' ELSE '> 1s' END AS latency_bucket, SUM(rs.count_executions) AS query_count, ROUND(100.0 * SUM(rs.count_executions) / SUM(SUM(rs.count_executions)) OVER(), 2) AS percentage, SUM(SUM(rs.count_executions)) OVER(ORDER BY MIN(rs.avg_duration)) AS cumulative_countFROM sys.query_store_plan pJOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_idGROUP BY CASE WHEN rs.avg_duration < 10000 THEN '0-10ms' WHEN rs.avg_duration < 50000 THEN '10-50ms' WHEN rs.avg_duration < 100000 THEN '50-100ms' WHEN rs.avg_duration < 500000 THEN '100-500ms' WHEN rs.avg_duration < 1000000 THEN '500ms-1s' ELSE '> 1s' ENDORDER BY MIN(rs.avg_duration); /*Sample output:| latency_bucket | query_count | percentage | cumulative_count ||----------------|-------------|------------|------------------|| 0-10ms | 850000 | 85.0% | 850000 || 10-50ms | 100000 | 10.0% | 950000 || 50-100ms | 30000 | 3.0% | 980000 || 100-500ms | 15000 | 1.5% | 995000 || 500ms-1s | 3000 | 0.3% | 998000 || > 1s | 2000 | 0.2% | 1000000 | From this:- p90 ≈ 10-50ms bucket- p95 ≈ 50-100ms bucket - p99 falls in 100-500ms bucket*/Choose percentile targets that align with business SLAs. If your SLA promises 95% of requests complete in under 2 seconds, monitor p95 against 2000ms. If user experience degrades significantly after 500ms, set alerting thresholds there regardless of formal SLA.
Historical analysis finds patterns over time. Real-time detection catches problems as they happen—alerting you to slow queries before users complain, and enabling investigation while full context is available.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175
-- ================================================================-- Real-Time Slow Query Detection-- ================================================================ -- =========================-- PostgreSQL: Currently Slow Queries-- ========================= -- Find queries running longer than thresholdSELECT pid, usename, application_name, client_addr, state, wait_event_type, wait_event, query_start, NOW() - query_start AS duration, EXTRACT(EPOCH FROM (NOW() - query_start)) AS duration_seconds, LEFT(query, 150) AS query_preview, CASE WHEN NOW() - query_start > INTERVAL '30 seconds' THEN 'CRITICAL' WHEN NOW() - query_start > INTERVAL '10 seconds' THEN 'WARNING' WHEN NOW() - query_start > INTERVAL '5 seconds' THEN 'MONITOR' ELSE 'NORMAL' END AS severityFROM pg_stat_activityWHERE state = 'active' AND pid != pg_backend_pid() AND query NOT LIKE '%pg_stat_activity%' AND NOW() - query_start > INTERVAL '1 second'ORDER BY query_start; -- Find blocking chains (slow due to locks)SELECT blocked.pid AS blocked_pid, blocked.usename AS blocked_user, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.usename AS blocking_user, blocking.query AS blocking_query, NOW() - blocked.query_start AS blocked_durationFROM pg_stat_activity blockedJOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))ORDER BY blocked_duration DESC; -- =========================-- MySQL: Currently Slow Queries-- ========================= -- Find long-running queries using processlistSELECT ID AS process_id, USER, HOST, DB, COMMAND, TIME AS running_seconds, STATE, LEFT(INFO, 200) AS query_preview, CASE WHEN TIME > 30 THEN 'CRITICAL' WHEN TIME > 10 THEN 'WARNING' WHEN TIME > 5 THEN 'MONITOR' ELSE 'NORMAL' END AS severityFROM information_schema.PROCESSLISTWHERE COMMAND != 'Sleep' AND TIME > 1ORDER BY TIME DESC; -- Using Performance Schema for more detailSELECT thread_id, event_name, sql_text, TIMER_WAIT / 1000000000 AS running_seconds, ROWS_EXAMINED, ROWS_SENT, CREATED_TMP_TABLES, CREATED_TMP_DISK_TABLES, NO_INDEX_USEDFROM performance_schema.events_statements_currentWHERE sql_text IS NOT NULL AND TIMER_WAIT > 1000000000 -- > 1 secondORDER BY TIMER_WAIT DESC; -- Find queries blocked by locksSELECT r.trx_id AS blocked_trx, r.trx_mysql_thread_id AS blocked_thread, r.trx_query AS blocked_query, b.trx_id AS blocking_trx, b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_queryFROM information_schema.innodb_lock_waits wJOIN information_schema.innodb_trx r ON w.requesting_trx_id = r.trx_idJOIN information_schema.innodb_trx b ON w.blocking_trx_id = b.trx_id; -- =========================-- SQL Server: Currently Slow Queries-- ========================= -- Find long-running queriesSELECT r.session_id, r.status, r.blocking_session_id, r.wait_type, r.wait_time / 1000 AS wait_seconds, r.cpu_time, r.total_elapsed_time / 1000 AS elapsed_seconds, r.logical_reads, r.reads AS physical_reads, r.row_count, s.login_name, s.host_name, DB_NAME(r.database_id) AS database_name, SUBSTRING(st.text, (r.statement_start_offset/2)+1, ((CASE r.statement_end_offset WHEN -1 THEN DATALENGTH(st.text) ELSE r.statement_end_offset END - r.statement_start_offset)/2)+1) AS current_statement, CASE WHEN r.total_elapsed_time > 30000 THEN 'CRITICAL' WHEN r.total_elapsed_time > 10000 THEN 'WARNING' WHEN r.total_elapsed_time > 5000 THEN 'MONITOR' ELSE 'NORMAL' END AS severityFROM sys.dm_exec_requests rJOIN sys.dm_exec_sessions s ON r.session_id = s.session_idCROSS APPLY sys.dm_exec_sql_text(r.sql_handle) stWHERE r.session_id > 50 AND r.session_id != @@SPID AND r.total_elapsed_time > 1000ORDER BY r.total_elapsed_time DESC; -- Full blocking chain analysis;WITH BlockingTree AS ( -- Blockers (not blocked by anyone) SELECT session_id, blocking_session_id, 0 AS level, CAST(session_id AS VARCHAR(1000)) AS chain FROM sys.dm_exec_requests WHERE blocking_session_id = 0 AND session_id IN (SELECT blocking_session_id FROM sys.dm_exec_requests WHERE blocking_session_id != 0) UNION ALL -- Blocked sessions SELECT r.session_id, r.blocking_session_id, bt.level + 1, CAST(bt.chain + ' -> ' + CAST(r.session_id AS VARCHAR(10)) AS VARCHAR(1000)) FROM sys.dm_exec_requests r JOIN BlockingTree bt ON r.blocking_session_id = bt.session_id)SELECT bt.session_id, bt.blocking_session_id, bt.level, bt.chain AS blocking_chain, r.wait_type, r.wait_time / 1000 AS wait_seconds, st.text AS query_textFROM BlockingTree btJOIN sys.dm_exec_requests r ON bt.session_id = r.session_idCROSS APPLY sys.dm_exec_sql_text(r.sql_handle) stORDER BY bt.chain;Wrap these queries in monitoring jobs that run every 30-60 seconds. Alert when queries exceed thresholds. Integration with tools like Prometheus, Datadog, or custom scripts enables proactive notification before user impact. The goal: know about slow queries before users report them.
Not all slow queries have equal priority. Workload categorization classifies queries by business impact, enabling strategic prioritization of optimization effort.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
-- ================================================================-- Workload Categorization Strategies-- ================================================================ -- =========================-- Categorize by Application Module-- ========================= -- PostgreSQL: Use application_name for categorizationSELECT application_name, COUNT(*) AS query_count, ROUND(SUM(total_exec_time)::NUMERIC / 1000, 2) AS total_time_sec, ROUND(AVG(mean_exec_time)::NUMERIC, 2) AS avg_time_ms, ROUND(SUM(total_exec_time) * 100.0 / SUM(SUM(total_exec_time)) OVER(), 2) AS pct_of_workloadFROM pg_stat_statements pssJOIN pg_stat_activity psa ON pss.userid = psa.usesysidWHERE calls > 10GROUP BY application_nameORDER BY total_time_sec DESC; /*Sample output:| application_name | total_time_sec | pct_of_workload ||---------------------|----------------|-----------------|| checkout-service | 45000 | 35% | ← High priority| reporting-job | 35000 | 27% | ← Can tolerate| user-api | 28000 | 22% | ← High priority| admin-dashboard | 12000 | 9% | ← Low priority| data-sync | 9000 | 7% | ← Background*/ -- =========================-- SQL Server: Categorize by Host/Application-- ========================= SELECT s.host_name, s.program_name, COUNT(*) AS request_count, SUM(r.total_elapsed_time) / 1000 AS total_ms, AVG(r.total_elapsed_time) / 1000 AS avg_ms, SUM(r.logical_reads) AS total_readsFROM sys.dm_exec_requests rJOIN sys.dm_exec_sessions s ON r.session_id = s.session_idWHERE r.session_id > 50GROUP BY s.host_name, s.program_nameORDER BY total_ms DESC; -- =========================-- Create Priority Classification System-- ========================= /*Example priority matrix: Priority 1 (CRITICAL): - Checkout, payment, authentication - User-facing, revenue-impacting - SLA: p95 < 500ms Priority 2 (HIGH): - Core API endpoints (profile, search, listing) - User-facing, experience-impacting - SLA: p95 < 1s Priority 3 (MEDIUM): - Admin dashboards, internal tools - Employee-facing - SLA: p95 < 5s Priority 4 (LOW): - Background jobs, reports, sync - No real-time user - SLA: Complete within window Priority 5 (BATCH): - ETL, analytics, maintenance - Off-hours acceptable - SLA: Daily completion*/ -- SQL Server: Query Store with priority tagging concept-- (In practice, map query patterns to priorities externally)SELECT qt.query_sql_text, SUM(rs.count_executions) AS executions, AVG(rs.avg_duration) / 1000.0 AS avg_ms, -- Priority heuristic based on query patterns CASE WHEN qt.query_sql_text LIKE '%checkout%' OR qt.query_sql_text LIKE '%payment%' OR qt.query_sql_text LIKE '%auth%login%' THEN 'P1-CRITICAL' WHEN qt.query_sql_text LIKE '%user%' OR qt.query_sql_text LIKE '%profile%' OR qt.query_sql_text LIKE '%search%' THEN 'P2-HIGH' WHEN qt.query_sql_text LIKE '%admin%' OR qt.query_sql_text LIKE '%internal%' THEN 'P3-MEDIUM' WHEN qt.query_sql_text LIKE '%report%' OR qt.query_sql_text LIKE '%sync%' OR qt.query_sql_text LIKE '%job%' THEN 'P4-LOW' ELSE 'P5-UNCLASSIFIED' END AS priority_tier, -- SLA evaluation based on priority CASE -- P1: Must be < 500ms WHEN (qt.query_sql_text LIKE '%checkout%' OR qt.query_sql_text LIKE '%payment%') AND AVG(rs.avg_duration) > 500000 THEN 'SLA VIOLATION' -- P2: Must be < 1000ms WHEN (qt.query_sql_text LIKE '%user%' OR qt.query_sql_text LIKE '%search%') AND AVG(rs.avg_duration) > 1000000 THEN 'SLA VIOLATION' ELSE 'WITHIN SLA' END AS sla_status FROM sys.query_store_query_text qtJOIN sys.query_store_query q ON qt.query_text_id = q.query_text_idJOIN sys.query_store_plan p ON q.query_id = p.query_idJOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_idGROUP BY qt.query_sql_textHAVING SUM(rs.count_executions) > 100ORDER BY CASE WHEN qt.query_sql_text LIKE '%checkout%' THEN 1 WHEN qt.query_sql_text LIKE '%user%' THEN 2 ELSE 3 END, avg_ms DESC;Manual review of slow query logs and aggregate statistics doesn't scale. Automated discovery pipelines continuously analyze query patterns and surface candidates for optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
-- ================================================================-- Automated Discovery Pipeline Components-- ================================================================ -- =========================-- PostgreSQL: Snapshot Collection Job-- ========================= -- Create table to store historical snapshotsCREATE TABLE IF NOT EXISTS query_stats_history ( snapshot_time TIMESTAMPTZ NOT NULL DEFAULT NOW(), queryid BIGINT, query TEXT, calls BIGINT, total_exec_time DOUBLE PRECISION, mean_exec_time DOUBLE PRECISION, rows BIGINT, shared_blks_hit BIGINT, shared_blks_read BIGINT, PRIMARY KEY (snapshot_time, queryid)); -- Snapshot procedure (run hourly via pg_cron or external scheduler)INSERT INTO query_stats_history (queryid, query, calls, total_exec_time, mean_exec_time, rows, shared_blks_hit, shared_blks_read)SELECT queryid, query, calls, total_exec_time, mean_exec_time, rows, shared_blks_hit, shared_blks_readFROM pg_stat_statementsWHERE calls > 10; -- Detect regressions: queries slower today than 7-day averageSELECT current_stats.query, current_stats.mean_exec_time AS current_avg_ms, historical_avg.avg_exec_time AS historical_avg_ms, current_stats.mean_exec_time / historical_avg.avg_exec_time AS regression_factorFROM pg_stat_statements current_statsJOIN ( SELECT queryid, AVG(mean_exec_time) AS avg_exec_time FROM query_stats_history WHERE snapshot_time > NOW() - INTERVAL '7 days' AND snapshot_time < NOW() - INTERVAL '1 day' GROUP BY queryid) historical_avg ON current_stats.queryid = historical_avg.queryidWHERE current_stats.mean_exec_time > historical_avg.avg_exec_time * 2ORDER BY regression_factor DESC; -- Detect new query patterns (first seen in last 24 hours)SELECT pss.queryid, pss.query, pss.calls, pss.mean_exec_timeFROM pg_stat_statements pssLEFT JOIN query_stats_history h ON pss.queryid = h.queryid AND h.snapshot_time < NOW() - INTERVAL '1 day'WHERE h.queryid IS NULL AND pss.calls > 100ORDER BY pss.total_exec_time DESC; -- =========================-- SQL Server: Automated Alert Procedure-- ========================= -- Procedure to check for slow queries and log/alertCREATE PROCEDURE dbo.CheckSlowQueriesASBEGIN SET NOCOUNT ON; -- Find currently long-running queries INSERT INTO dbo.SlowQueryAlerts (session_id, query_text, elapsed_ms, cpu_ms, logical_reads, alert_time) SELECT r.session_id, SUBSTRING(st.text, (r.statement_start_offset/2)+1, 4000), r.total_elapsed_time, r.cpu_time, r.logical_reads, GETDATE() FROM sys.dm_exec_requests r CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st WHERE r.total_elapsed_time > 30000 -- 30 seconds AND r.session_id > 50; -- Check for regression from Query Store baseline INSERT INTO dbo.RegressionAlerts (query_id, current_avg_ms, baseline_avg_ms, regression_factor, alert_time) SELECT q.query_id, current_perf.avg_ms, baseline_perf.avg_ms, current_perf.avg_ms / baseline_perf.avg_ms, GETDATE() FROM sys.query_store_query q CROSS APPLY ( SELECT AVG(rs.avg_duration) / 1000.0 AS avg_ms FROM sys.query_store_plan p JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id JOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id WHERE p.query_id = q.query_id AND rsi.start_time > DATEADD(hour, -24, GETUTCDATE()) ) current_perf CROSS APPLY ( SELECT AVG(rs.avg_duration) / 1000.0 AS avg_ms FROM sys.query_store_plan p JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id JOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id WHERE p.query_id = q.query_id AND rsi.start_time BETWEEN DATEADD(day, -7, GETUTCDATE()) AND DATEADD(day, -1, GETUTCDATE()) ) baseline_perf WHERE current_perf.avg_ms > baseline_perf.avg_ms * 2 AND baseline_perf.avg_ms > 100; -- Only significant queriesEND;GO -- Schedule with SQL Server Agent job to run every 5 minutes -- =========================-- Alert Query for Dashboard/Monitoring-- ========================= -- Summary of optimization candidatesSELECT 'High Impact' AS category, COUNT(*) AS query_count, SUM(total_exec_time) / 1000000.0 AS total_hoursFROM pg_stat_statementsWHERE total_exec_time > (SELECT SUM(total_exec_time) * 0.05 FROM pg_stat_statements)UNION ALLSELECT 'Regression' AS category, COUNT(*) AS query_count, NULLFROM ( SELECT queryid FROM pg_stat_statements pss -- ... regression detection logic) regressionsUNION ALLSELECT 'New Patterns' AS category, COUNT(*) AS query_count, NULLFROM ( SELECT queryid FROM pg_stat_statements pss -- ... new pattern detection logic) new_patterns;Systematic slow query identification transforms performance work from reactive troubleshooting into proactive engineering. Multiple detection strategies complement each other to provide comprehensive coverage.
What's next:
Once you've identified slow queries, the next step is fixing them. The next page covers query rewriting—systematic techniques for transforming slow SQL into efficient SQL without changing the result set.
You now possess a comprehensive toolkit for slow query discovery—from basic logging through automated detection pipelines. You can implement these techniques across any database platform and scale from small applications to enterprise workloads.