Query Profiling - Learning Module

Loading content...

0/241

Identifying Slow Queries

Finding What to Optimize

A production database executes thousands—often millions—of queries daily. Among this deluge, only a fraction truly need optimization. The challenge isn't optimizing a known slow query; it's systematically discovering which queries are slow in the first place.

Slow query identification transforms performance work from reactive firefighting into proactive engineering. Instead of waiting for users to complain about unresponsive applications, you establish detection systems that surface problematic queries before they impact user experience.

This page establishes systematic approaches for discovering slow queries across different database platforms, using techniques that scale from single-developer applications to enterprise data platforms.

What You Will Learn

By the end of this page, you will understand how to configure slow query logging, leverage aggregate statistics for workload analysis, apply threshold-based detection strategies, and use heat map approaches to identify optimization opportunities efficiently. You'll be equipped to systematically surface the queries that deserve your attention.

The Problem of Query Discovery

Why is slow query identification challenging? Consider the characteristics of production workloads:

Volume: A busy database might execute 10,000+ queries per second. Manual review is impossible.

Variability: The same query template with different parameter values may perform vastly differently. A user lookup by ID might be instant for user 12345 but slow for user 67890 due to data skew.

Context Dependence: A query that's fast in isolation might be slow during peak hours due to resource contention. Time-of-day and concurrent workload matter.

Hidden Cost: A cheap query executed millions of times may consume more total resources than an expensive query executed once. Per-execution time isn't the only metric.

Intermittent Issues: Some queries are slow only occasionally—when statistics are stale, when cache is cold, when locks are held. Point-in-time measurement misses these cases.

Slow Query Detection Strategies

•Threshold-Based Logging — Capture queries exceeding a time threshold. Simple but misses aggregate cost of many fast queries.
•Aggregate Statistics — Analyze total resource consumption across many executions. Finds expensive patterns, not individual occurrences.
•Percentile Analysis — Focus on 95th/99th percentile latency to find worst-case behavior affecting user experience.
•Comparative Analysis — Compare current performance against baselines to detect regressions.
•Resource Attribution — Identify queries consuming the most CPU, I/O, or memory regardless of execution time.

The 80/20 Rule in Query Optimization

In most workloads, 20% of query patterns consume 80% of resources. Effective discovery focuses on finding this vital few—the query patterns that, once optimized, dramatically improve overall system performance. Don't try to optimize everything; optimize what matters.

Slow Query Logs

Slow query logs are the foundational discovery mechanism—database infrastructure that automatically captures queries exceeding a configured time threshold. Every major database supports this capability, though implementation details vary.

slow_query_logging.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
-- ================================================================
-- Slow Query Log Configuration Across Platforms
-- ================================================================
 
-- =========================
-- MySQL Slow Query Log
-- =========================
 
-- Check current slow query log settings
SHOW VARIABLES LIKE '%slow_query%';
SHOW VARIABLES LIKE '%long_query_time%';
 
-- Enable slow query logging (requires restart or dynamic variables)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
 
-- Set threshold to 1 second (queries >= 1 second are logged)
SET GLOBAL long_query_time = 1.0;
 
-- Log queries not using indexes (optional, powerful but verbose)
SET GLOBAL log_queries_not_using_indexes = 'ON';
SET GLOBAL log_throttle_queries_not_using_indexes = 60; -- Max per minute
 
-- For persistent configuration, add to my.cnf:
/*
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 1.0
log_queries_not_using_indexes = 1
log_throttle_queries_not_using_indexes = 60
*/
 
-- =========================
-- MySQL: Parse slow query log with mysqldumpslow
-- =========================
 
/*
Command line:
$ mysqldumpslow -s t -t 10 /var/log/mysql/slow.log
 
Options:
  -s t    Sort by total time
  -s c    Sort by count (frequency)
  -s at   Sort by average time
  -t 10   Top 10 results
 
Output normalizes queries (replaces literals with N, S):
Count: 1245  Time=2.50s (3112s)  Lock=0.00s (0s)  Rows=100.0 (124500)
  SELECT * FROM orders WHERE customer_id = N
 
This reveals:
- Query pattern executed 1245 times
- Average 2.5s per execution
- Total 3112 seconds consumed
- Returns ~100 rows average
*/
 
-- =========================
-- PostgreSQL: Log Configuration
-- =========================
 
-- View current logging configuration
SHOW log_min_duration_statement;
SHOW log_statement;
 
-- Configure in postgresql.conf:
/*
# Log all statements taking more than 1 second
log_min_duration_statement = 1000  # milliseconds
 
# Alternative: log all statements for analysis
log_statement = 'all'  # WARNING: Very verbose
 
# Log duration of all statements (without query text)
log_duration = on
 
# Enhanced logging with timing
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
 
# Log parameters for prepared statements
log_parameters_on_error = on
*/
 
-- View logged queries in PostgreSQL log file or:
SELECT * FROM pg_stat_activity WHERE state != 'idle';
 
-- =========================
-- PostgreSQL: auto_explain for automatic EXPLAIN on slow queries
-- =========================
 
/*
Load module in postgresql.conf:
shared_preload_libraries = 'auto_explain'
 
Configure:
auto_explain.log_min_duration = '1s'      # Threshold
auto_explain.log_analyze = on              # Include ANALYZE output
auto_explain.log_buffers = on              # Include buffer stats
auto_explain.log_timing = on               # Include timing
auto_explain.log_triggers = on             # Include trigger time
auto_explain.log_verbose = on              # Verbose output
auto_explain.log_nested_statements = on    # Log nested statements
 
Result: Slow queries automatically get EXPLAIN ANALYZE in logs
Invaluable for production diagnosis without manual intervention
*/
 
-- =========================
-- SQL Server: Extended Events for Slow Query Capture
-- =========================
 
-- Create XE session to capture slow queries
CREATE EVENT SESSION SlowQueryCapture ON SERVER
ADD EVENT sqlserver.sql_statement_completed (
    ACTION (
        sqlserver.sql_text,
        sqlserver.database_name,
        sqlserver.username,
        sqlserver.client_app_name,
        sqlserver.query_hash,
        sqlserver.query_plan_hash
    )
    WHERE (
        duration >= 1000000  -- 1 second in microseconds
    )
),
ADD EVENT sqlserver.rpc_completed (
    ACTION (
        sqlserver.sql_text,
        sqlserver.database_name,
        sqlserver.username,
        sqlserver.query_hash
    )
    WHERE (
        duration >= 1000000
    )
)
ADD TARGET package0.event_file (
    SET filename = N'SlowQueries.xel',
        max_file_size = 100  -- MB
)
WITH (
    MAX_MEMORY = 4096 KB,
    EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
    MAX_DISPATCH_LATENCY = 5 SECONDS,
    STARTUP_STATE = ON  -- Auto-start with server
);
 
-- Start the session
ALTER EVENT SESSION SlowQueryCapture ON SERVER STATE = START;
 
-- Query captured slow queries
SELECT 
    event_data.value('(event/@timestamp)[1]', 'datetime2') AS event_time,
    event_data.value('(event/data[@name="duration"]/value)[1]', 'bigint') / 1000000.0 AS duration_sec,
    event_data.value('(event/data[@name="cpu_time"]/value)[1]', 'bigint') / 1000000.0 AS cpu_sec,
    event_data.value('(event/data[@name="logical_reads"]/value)[1]', 'bigint') AS logical_reads,
    event_data.value('(event/action[@name="database_name"]/value)[1]', 'nvarchar(128)') AS database_name,
    event_data.value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS sql_text
FROM (
    SELECT CAST(event_data AS XML) AS event_data
    FROM sys.fn_xe_file_target_read_file('SlowQueries*.xel', NULL, NULL, NULL)
) AS events
ORDER BY event_time DESC;

Threshold Selection Trade-offs

Setting thresholds too low generates overwhelming log volume; too high misses important queries. Start with 1-2 seconds for production systems, lower for development. Adjust based on log volume and SLA requirements. A threshold catching 0.1% of queries is typically manageable.

Aggregate Statistics Analysis

While slow query logs capture individual occurrences, aggregate statistics reveal patterns across many executions. A query taking 50ms isn't necessarily a problem—unless it executes 100,000 times per hour, consuming 83 minutes of total CPU time.

Aggregate analysis identifies query patterns by total resource consumption, not just per-execution cost.

aggregate_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
-- ================================================================
-- Aggregate Statistics Analysis for Query Discovery
-- ================================================================
 
-- =========================
-- PostgreSQL: pg_stat_statements Analysis
-- =========================
 
-- Top queries by TOTAL time (most cumulative impact)
SELECT 
    calls,
    total_exec_time::NUMERIC(12,2) AS total_ms,
    mean_exec_time::NUMERIC(12,2) AS avg_ms,
    stddev_exec_time::NUMERIC(12,2) AS stddev_ms,
    ROUND(100.0 * total_exec_time / 
          SUM(total_exec_time) OVER(), 2) AS pct_of_total,
    rows,
    shared_blks_hit + shared_blks_read AS total_blocks,
    LEFT(query, 80) AS query_pattern
FROM pg_stat_statements
WHERE calls > 10  -- Filter noise
ORDER BY total_exec_time DESC
LIMIT 20;
 
/*
Interpretation:
- Query with 50ms avg but 100K calls = 83 minutes total
- Query with 10s avg but 50 calls = 8 minutes total
- First query has 10x more impact on system
 
Focus on queries with highest total_exec_time first
*/
 
-- Top queries by AVERAGE time (worst individual performance)
SELECT 
    calls,
    mean_exec_time::NUMERIC(12,2) AS avg_ms,
    min_exec_time::NUMERIC(12,2) AS min_ms,
    max_exec_time::NUMERIC(12,2) AS max_ms,
    stddev_exec_time::NUMERIC(12,2) AS stddev_ms,
    rows / calls AS avg_rows,
    LEFT(query, 80) AS query_pattern
FROM pg_stat_statements
WHERE calls >= 100  -- Sufficient sample size
ORDER BY mean_exec_time DESC
LIMIT 20;
 
-- Queries with HIGH VARIABILITY (inconsistent performance)
SELECT 
    calls,
    mean_exec_time::NUMERIC(10,2) AS avg_ms,
    stddev_exec_time::NUMERIC(10,2) AS stddev_ms,
    min_exec_time::NUMERIC(10,2) AS min_ms,
    max_exec_time::NUMERIC(10,2) AS max_ms,
    (max_exec_time / NULLIF(min_exec_time, 0.001))::NUMERIC(10,1) AS variance_ratio,
    LEFT(query, 80) AS query_pattern
FROM pg_stat_statements
WHERE calls > 100
  AND mean_exec_time > 10  -- Non-trivial queries
  AND max_exec_time > mean_exec_time * 10  -- High variability
ORDER BY variance_ratio DESC
LIMIT 20;
 
/*
High variability indicates:
- Parameter-dependent performance (data skew)
- Intermittent resource contention
- Plan instability (different plans chosen)
- Cache-dependent behavior
 
These queries often cause unpredictable user experience
*/
 
-- =========================
-- MySQL: Performance Schema Digest Analysis
-- =========================
 
-- Top query patterns by total time
SELECT 
    SCHEMA_NAME AS db,
    DIGEST_TEXT,
    COUNT_STAR AS exec_count,
    ROUND(SUM_TIMER_WAIT / 1000000000000, 2) AS total_time_sec,
    ROUND(AVG_TIMER_WAIT / 1000000000, 2) AS avg_time_ms,
    SUM_ROWS_EXAMINED AS total_rows_examined,
    SUM_ROWS_SENT AS total_rows_sent,
    SUM_ROWS_EXAMINED / NULLIF(SUM_ROWS_SENT, 0) AS examine_to_send_ratio,
    ROUND(100.0 * SUM_TIMER_WAIT / 
          (SELECT SUM(SUM_TIMER_WAIT) FROM performance_schema.events_statements_summary_by_digest), 
          2) AS pct_of_total
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME IS NOT NULL
  AND COUNT_STAR > 10
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;
 
-- Queries causing most I/O (rows examined)
SELECT 
    DIGEST_TEXT,
    COUNT_STAR AS exec_count,
    SUM_ROWS_EXAMINED AS total_rows,
    SUM_ROWS_EXAMINED / COUNT_STAR AS avg_rows,
    ROUND(SUM_TIMER_WAIT / 1000000000000, 2) AS total_time_sec,
    SUM_CREATED_TMP_TABLES AS temp_tables,
    SUM_CREATED_TMP_DISK_TABLES AS disk_temp_tables,
    SUM_NO_INDEX_USED AS no_index_count
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_ROWS_EXAMINED > 100000  -- Significant I/O
ORDER BY SUM_ROWS_EXAMINED DESC
LIMIT 20;
 
-- =========================
-- SQL Server: Query Store Aggregate Analysis
-- =========================
 
-- Top queries by total CPU consumption
SELECT TOP 20
    q.query_id,
    qt.query_sql_text,
    SUM(rs.count_executions) AS total_executions,
    SUM(rs.avg_cpu_time * rs.count_executions) / 1000000.0 AS total_cpu_sec,
    AVG(rs.avg_cpu_time) / 1000.0 AS avg_cpu_ms,
    SUM(rs.avg_logical_io_reads * rs.count_executions) AS total_logical_reads,
    ROUND(100.0 * SUM(rs.avg_cpu_time * rs.count_executions) / 
          (SELECT SUM(rs2.avg_cpu_time * rs2.count_executions) 
           FROM sys.query_store_runtime_stats rs2), 2) AS pct_total_cpu
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
GROUP BY q.query_id, qt.query_sql_text
ORDER BY total_cpu_sec DESC;
 
-- Queries with performance regression (recent worse than historical)
SELECT 
    q.query_id,
    LEFT(qt.query_sql_text, 100) AS query_text,
    rsi_recent.avg_duration / 1000.0 AS recent_avg_ms,
    rsi_historical.avg_duration / 1000.0 AS historical_avg_ms,
    rsi_recent.avg_duration / NULLIF(rsi_historical.avg_duration, 0) AS regression_factor
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN (
    SELECT plan_id, AVG(avg_duration) AS avg_duration
    FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_runtime_stats_interval rsi 
        ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
    WHERE rsi.start_time > DATEADD(day, -1, GETUTCDATE())  -- Last day
    GROUP BY plan_id
) rsi_recent ON p.plan_id = rsi_recent.plan_id
JOIN (
    SELECT plan_id, AVG(avg_duration) AS avg_duration
    FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_runtime_stats_interval rsi 
        ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
    WHERE rsi.start_time BETWEEN DATEADD(day, -30, GETUTCDATE()) 
                            AND DATEADD(day, -7, GETUTCDATE())  -- 4-week history
    GROUP BY plan_id
) rsi_historical ON p.plan_id = rsi_historical.plan_id
WHERE rsi_recent.avg_duration > rsi_historical.avg_duration * 2  -- 2x regression
ORDER BY regression_factor DESC;

Weighted Prioritization

Combine total time with optimization potential. A query consuming 20% of CPU with easy optimization (missing index) is more actionable than one consuming 30% but already well-optimized. Consider: Impact × Optimizability = Priority Score.

Percentile-Based Detection

Average execution time can be misleading. A query averaging 100ms might have:

95% of executions at 50ms
5% of executions at 1,050ms

The average looks acceptable, but 1 in 20 users experiences terrible latency. Percentile-based detection surfaces these outliers that disproportionately impact user experience.

Percentile Metrics and Their Significance
Metric	Meaning	Use Case
p50 (Median)	Typical user experience	Baseline performance expectation
p90	Slowest 10% of requests	Identifies common slow cases
p95	Slowest 5% of requests	Standard SLA target for critical paths
p99	Slowest 1% of requests	Tail latency for high-value operations
p99.9	Worst 0.1% (1 in 1000)	Extreme outliers, often hardware/contention

percentile_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
-- ================================================================
-- Percentile-Based Slow Query Detection
-- ================================================================
 
-- =========================
-- PostgreSQL: Percentile Analysis
-- =========================
 
-- Using pg_stat_statements (no built-in percentiles, use stddev heuristic)
SELECT 
    query,
    calls,
    mean_exec_time::NUMERIC(10,2) AS avg_ms,
    stddev_exec_time::NUMERIC(10,2) AS stddev_ms,
    min_exec_time::NUMERIC(10,2) AS min_ms,
    max_exec_time::NUMERIC(10,2) AS max_ms,
    
    -- Estimated percentiles using normal distribution assumption
    (mean_exec_time + 1.28 * stddev_exec_time)::NUMERIC(10,2) AS est_p90_ms,
    (mean_exec_time + 1.65 * stddev_exec_time)::NUMERIC(10,2) AS est_p95_ms,
    (mean_exec_time + 2.33 * stddev_exec_time)::NUMERIC(10,2) AS est_p99_ms,
    
    -- Identify queries with tail latency issues
    CASE 
        WHEN max_exec_time > mean_exec_time * 20 THEN 'SEVERE TAIL LATENCY'
        WHEN max_exec_time > mean_exec_time * 10 THEN 'MODERATE TAIL LATENCY'
        WHEN max_exec_time > mean_exec_time * 5 THEN 'SOME TAIL LATENCY'
        ELSE 'CONSISTENT PERFORMANCE'
    END AS tail_latency_status
FROM pg_stat_statements
WHERE calls > 100
ORDER BY (max_exec_time / NULLIF(mean_exec_time, 0.001)) DESC
LIMIT 20;
 
/*
Note: pg_stat_statements doesn't store histograms
For true percentiles, use:
- pg_stat_kcache extension
- External log analysis tools
- Application-level instrumentation
*/
 
-- =========================
-- SQL Server: Query Store Percentile Analysis
-- =========================
 
-- Built-in percentile support in Query Store
SELECT 
    q.query_id,
    LEFT(qt.query_sql_text, 80) AS query_preview,
    SUM(rs.count_executions) AS executions,
    AVG(rs.avg_duration) / 1000.0 AS avg_ms,
    
    -- Percentile columns (actual percentiles from histograms)
    AVG(rs.avg_duration) / 1000.0 AS avg_ms,
    MIN(rs.min_duration) / 1000.0 AS min_ms,
    MAX(rs.max_duration) / 1000.0 AS max_ms,
    
    -- Stdev as proxy for percentile spread
    STDEV(rs.avg_duration) / 1000.0 AS stddev_ms,
    
    -- Identify tail latency issues
    MAX(rs.max_duration) / NULLIF(AVG(rs.avg_duration), 0) AS max_to_avg_ratio
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id  
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
GROUP BY q.query_id, qt.query_sql_text
HAVING SUM(rs.count_executions) > 100
ORDER BY max_to_avg_ratio DESC;
 
-- For actual percentile calculation, use sys.query_store_wait_stats
-- or application-side instrumentation with histograms
 
-- =========================
-- Creating a Custom Percentile Tracking View (Concept)
-- =========================
 
/*
For production percentile tracking, consider:
 
1. HISTOGRAM BUCKETS approach:
   - Define latency buckets: 0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+
   - Count queries in each bucket
   - Calculate percentiles from cumulative distribution
 
2. SAMPLING approach:
   - Log every Nth query with full timing
   - Build distribution from samples
   - Trade accuracy for lower overhead
 
3. APPLICATION INSTRUMENTATION:
   - Capture timing at application layer
   - Store in time-series database (Prometheus, InfluxDB)
   - Calculate percentiles across windows
   
Example bucket approach:
*/
 
-- SQL Server: Create latency histogram
SELECT 
    CASE 
        WHEN rs.avg_duration < 10000 THEN '0-10ms'
        WHEN rs.avg_duration < 50000 THEN '10-50ms'
        WHEN rs.avg_duration < 100000 THEN '50-100ms'
        WHEN rs.avg_duration < 500000 THEN '100-500ms'
        WHEN rs.avg_duration < 1000000 THEN '500ms-1s'
        ELSE '> 1s'
    END AS latency_bucket,
    SUM(rs.count_executions) AS query_count,
    ROUND(100.0 * SUM(rs.count_executions) / 
          SUM(SUM(rs.count_executions)) OVER(), 2) AS percentage,
    SUM(SUM(rs.count_executions)) OVER(ORDER BY MIN(rs.avg_duration)) AS cumulative_count
FROM sys.query_store_plan p
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
GROUP BY 
    CASE 
        WHEN rs.avg_duration < 10000 THEN '0-10ms'
        WHEN rs.avg_duration < 50000 THEN '10-50ms'
        WHEN rs.avg_duration < 100000 THEN '50-100ms'
        WHEN rs.avg_duration < 500000 THEN '100-500ms'
        WHEN rs.avg_duration < 1000000 THEN '500ms-1s'
        ELSE '> 1s'
    END
ORDER BY MIN(rs.avg_duration);
 
/*
Sample output:
| latency_bucket | query_count | percentage | cumulative_count |
|----------------|-------------|------------|------------------|
| 0-10ms         | 850000      | 85.0%      | 850000           |
| 10-50ms        | 100000      | 10.0%      | 950000           |
| 50-100ms       | 30000       | 3.0%       | 980000           |
| 100-500ms      | 15000       | 1.5%       | 995000           |
| 500ms-1s       | 3000        | 0.3%       | 998000           |
| > 1s           | 2000        | 0.2%       | 1000000          |
 
From this:
- p90 ≈ 10-50ms bucket
- p95 ≈ 50-100ms bucket  
- p99 falls in 100-500ms bucket
*/

SLA Alignment

Choose percentile targets that align with business SLAs. If your SLA promises 95% of requests complete in under 2 seconds, monitor p95 against 2000ms. If user experience degrades significantly after 500ms, set alerting thresholds there regardless of formal SLA.

Real-Time Query Detection

Historical analysis finds patterns over time. Real-time detection catches problems as they happen—alerting you to slow queries before users complain, and enabling investigation while full context is available.

realtime_detection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
-- ================================================================
-- Real-Time Slow Query Detection
-- ================================================================
 
-- =========================
-- PostgreSQL: Currently Slow Queries
-- =========================
 
-- Find queries running longer than threshold
SELECT 
    pid,
    usename,
    application_name,
    client_addr,
    state,
    wait_event_type,
    wait_event,
    query_start,
    NOW() - query_start AS duration,
    EXTRACT(EPOCH FROM (NOW() - query_start)) AS duration_seconds,
    LEFT(query, 150) AS query_preview,
    CASE 
        WHEN NOW() - query_start > INTERVAL '30 seconds' THEN 'CRITICAL'
        WHEN NOW() - query_start > INTERVAL '10 seconds' THEN 'WARNING'
        WHEN NOW() - query_start > INTERVAL '5 seconds' THEN 'MONITOR'
        ELSE 'NORMAL'
    END AS severity
FROM pg_stat_activity
WHERE state = 'active'
  AND pid != pg_backend_pid()
  AND query NOT LIKE '%pg_stat_activity%'
  AND NOW() - query_start > INTERVAL '1 second'
ORDER BY query_start;
 
-- Find blocking chains (slow due to locks)
SELECT 
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocking.query AS blocking_query,
    NOW() - blocked.query_start AS blocked_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
ORDER BY blocked_duration DESC;
 
-- =========================
-- MySQL: Currently Slow Queries
-- =========================
 
-- Find long-running queries using processlist
SELECT 
    ID AS process_id,
    USER,
    HOST,
    DB,
    COMMAND,
    TIME AS running_seconds,
    STATE,
    LEFT(INFO, 200) AS query_preview,
    CASE 
        WHEN TIME > 30 THEN 'CRITICAL'
        WHEN TIME > 10 THEN 'WARNING'
        WHEN TIME > 5 THEN 'MONITOR'
        ELSE 'NORMAL'
    END AS severity
FROM information_schema.PROCESSLIST
WHERE COMMAND != 'Sleep'
  AND TIME > 1
ORDER BY TIME DESC;
 
-- Using Performance Schema for more detail
SELECT
    thread_id,
    event_name,
    sql_text,
    TIMER_WAIT / 1000000000 AS running_seconds,
    ROWS_EXAMINED,
    ROWS_SENT,
    CREATED_TMP_TABLES,
    CREATED_TMP_DISK_TABLES,
    NO_INDEX_USED
FROM performance_schema.events_statements_current
WHERE sql_text IS NOT NULL
  AND TIMER_WAIT > 1000000000  -- > 1 second
ORDER BY TIMER_WAIT DESC;
 
-- Find queries blocked by locks
SELECT 
    r.trx_id AS blocked_trx,
    r.trx_mysql_thread_id AS blocked_thread,
    r.trx_query AS blocked_query,
    b.trx_id AS blocking_trx,
    b.trx_mysql_thread_id AS blocking_thread,
    b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx r ON w.requesting_trx_id = r.trx_id
JOIN information_schema.innodb_trx b ON w.blocking_trx_id = b.trx_id;
 
-- =========================
-- SQL Server: Currently Slow Queries
-- =========================
 
-- Find long-running queries
SELECT 
    r.session_id,
    r.status,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time / 1000 AS wait_seconds,
    r.cpu_time,
    r.total_elapsed_time / 1000 AS elapsed_seconds,
    r.logical_reads,
    r.reads AS physical_reads,
    r.row_count,
    s.login_name,
    s.host_name,
    DB_NAME(r.database_id) AS database_name,
    SUBSTRING(st.text, 
        (r.statement_start_offset/2)+1,
        ((CASE r.statement_end_offset 
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE r.statement_end_offset
         END - r.statement_start_offset)/2)+1) AS current_statement,
    CASE 
        WHEN r.total_elapsed_time > 30000 THEN 'CRITICAL'
        WHEN r.total_elapsed_time > 10000 THEN 'WARNING'
        WHEN r.total_elapsed_time > 5000 THEN 'MONITOR'
        ELSE 'NORMAL'
    END AS severity
FROM sys.dm_exec_requests r
JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
WHERE r.session_id > 50
  AND r.session_id != @@SPID
  AND r.total_elapsed_time > 1000
ORDER BY r.total_elapsed_time DESC;
 
-- Full blocking chain analysis
;WITH BlockingTree AS (
    -- Blockers (not blocked by anyone)
    SELECT 
        session_id,
        blocking_session_id,
        0 AS level,
        CAST(session_id AS VARCHAR(1000)) AS chain
    FROM sys.dm_exec_requests
    WHERE blocking_session_id = 0 
      AND session_id IN (SELECT blocking_session_id FROM sys.dm_exec_requests WHERE blocking_session_id != 0)
    
    UNION ALL
    
    -- Blocked sessions
    SELECT 
        r.session_id,
        r.blocking_session_id,
        bt.level + 1,
        CAST(bt.chain + ' -> ' + CAST(r.session_id AS VARCHAR(10)) AS VARCHAR(1000))
    FROM sys.dm_exec_requests r
    JOIN BlockingTree bt ON r.blocking_session_id = bt.session_id
)
SELECT 
    bt.session_id,
    bt.blocking_session_id,
    bt.level,
    bt.chain AS blocking_chain,
    r.wait_type,
    r.wait_time / 1000 AS wait_seconds,
    st.text AS query_text
FROM BlockingTree bt
JOIN sys.dm_exec_requests r ON bt.session_id = r.session_id
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
ORDER BY bt.chain;

Automated Alerting

Wrap these queries in monitoring jobs that run every 30-60 seconds. Alert when queries exceed thresholds. Integration with tools like Prometheus, Datadog, or custom scripts enables proactive notification before user impact. The goal: know about slow queries before users report them.

Workload Categorization

Not all slow queries have equal priority. Workload categorization classifies queries by business impact, enabling strategic prioritization of optimization effort.

Query Classification Dimensions

•User-Facing vs. Background — A slow report job at 3 AM matters less than a slow checkout page. Tag queries by their user visibility.
•Transactional vs. Analytical — OLTP queries need sub-second response; analytics can tolerate minutes. Different SLA expectations.
•Revenue-Critical Path — Queries in purchase workflows, signup flows, or core features deserve highest priority.
•Frequency Tier — Queries executed millions of times daily need lower per-execution latency than rare queries.
•Concurrency Impact — A slow query holding locks blocks others. Consider downstream impact, not just the query itself.

workload_categorization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
-- ================================================================
-- Workload Categorization Strategies
-- ================================================================
 
-- =========================
-- Categorize by Application Module
-- =========================
 
-- PostgreSQL: Use application_name for categorization
SELECT 
    application_name,
    COUNT(*) AS query_count,
    ROUND(SUM(total_exec_time)::NUMERIC / 1000, 2) AS total_time_sec,
    ROUND(AVG(mean_exec_time)::NUMERIC, 2) AS avg_time_ms,
    ROUND(SUM(total_exec_time) * 100.0 / 
          SUM(SUM(total_exec_time)) OVER(), 2) AS pct_of_workload
FROM pg_stat_statements pss
JOIN pg_stat_activity psa ON pss.userid = psa.usesysid
WHERE calls > 10
GROUP BY application_name
ORDER BY total_time_sec DESC;
 
/*
Sample output:
| application_name    | total_time_sec | pct_of_workload |
|---------------------|----------------|-----------------|
| checkout-service    | 45000          | 35%             | ← High priority
| reporting-job       | 35000          | 27%             | ← Can tolerate
| user-api            | 28000          | 22%             | ← High priority
| admin-dashboard     | 12000          | 9%              | ← Low priority
| data-sync           | 9000           | 7%              | ← Background
*/
 
-- =========================
-- SQL Server: Categorize by Host/Application
-- =========================
 
SELECT 
    s.host_name,
    s.program_name,
    COUNT(*) AS request_count,
    SUM(r.total_elapsed_time) / 1000 AS total_ms,
    AVG(r.total_elapsed_time) / 1000 AS avg_ms,
    SUM(r.logical_reads) AS total_reads
FROM sys.dm_exec_requests r
JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id
WHERE r.session_id > 50
GROUP BY s.host_name, s.program_name
ORDER BY total_ms DESC;
 
-- =========================
-- Create Priority Classification System
-- =========================
 
/*
Example priority matrix:
 
Priority 1 (CRITICAL): 
  - Checkout, payment, authentication
  - User-facing, revenue-impacting
  - SLA: p95 < 500ms
 
Priority 2 (HIGH):
  - Core API endpoints (profile, search, listing)
  - User-facing, experience-impacting  
  - SLA: p95 < 1s
 
Priority 3 (MEDIUM):
  - Admin dashboards, internal tools
  - Employee-facing
  - SLA: p95 < 5s
 
Priority 4 (LOW):
  - Background jobs, reports, sync
  - No real-time user
  - SLA: Complete within window
 
Priority 5 (BATCH):
  - ETL, analytics, maintenance
  - Off-hours acceptable
  - SLA: Daily completion
*/
 
-- SQL Server: Query Store with priority tagging concept
-- (In practice, map query patterns to priorities externally)
SELECT 
    qt.query_sql_text,
    SUM(rs.count_executions) AS executions,
    AVG(rs.avg_duration) / 1000.0 AS avg_ms,
    
    -- Priority heuristic based on query patterns
    CASE 
        WHEN qt.query_sql_text LIKE '%checkout%' 
          OR qt.query_sql_text LIKE '%payment%'
          OR qt.query_sql_text LIKE '%auth%login%' 
        THEN 'P1-CRITICAL'
        
        WHEN qt.query_sql_text LIKE '%user%'
          OR qt.query_sql_text LIKE '%profile%'
          OR qt.query_sql_text LIKE '%search%'
        THEN 'P2-HIGH'
        
        WHEN qt.query_sql_text LIKE '%admin%'
          OR qt.query_sql_text LIKE '%internal%'
        THEN 'P3-MEDIUM'
        
        WHEN qt.query_sql_text LIKE '%report%'
          OR qt.query_sql_text LIKE '%sync%'
          OR qt.query_sql_text LIKE '%job%'
        THEN 'P4-LOW'
        
        ELSE 'P5-UNCLASSIFIED'
    END AS priority_tier,
    
    -- SLA evaluation based on priority
    CASE 
        -- P1: Must be < 500ms
        WHEN (qt.query_sql_text LIKE '%checkout%' OR qt.query_sql_text LIKE '%payment%')
          AND AVG(rs.avg_duration) > 500000 
        THEN 'SLA VIOLATION'
        
        -- P2: Must be < 1000ms
        WHEN (qt.query_sql_text LIKE '%user%' OR qt.query_sql_text LIKE '%search%')
          AND AVG(rs.avg_duration) > 1000000
        THEN 'SLA VIOLATION'
        
        ELSE 'WITHIN SLA'
    END AS sla_status
    
FROM sys.query_store_query_text qt
JOIN sys.query_store_query q ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
GROUP BY qt.query_sql_text
HAVING SUM(rs.count_executions) > 100
ORDER BY 
    CASE 
        WHEN qt.query_sql_text LIKE '%checkout%' THEN 1
        WHEN qt.query_sql_text LIKE '%user%' THEN 2
        ELSE 3
    END,
    avg_ms DESC;

Discovery Automation

Manual review of slow query logs and aggregate statistics doesn't scale. Automated discovery pipelines continuously analyze query patterns and surface candidates for optimization.

Automation Components

•Scheduled Collection — Periodic snapshots of aggregate statistics into a time-series store. Enable trend analysis and regression detection.
•Threshold Alerting — Automatic notification when queries exceed latency, CPU, or I/O thresholds. Integrate with PagerDuty, Slack, or email.
•Regression Detection — Compare current performance against baseline. Alert when queries regress by 2x+ from historical average.
•New Query Detection — Flag query patterns appearing for the first time. New deployments or application changes may introduce problems.
•Reporting Dashboards — Visualization of slow query trends, top offenders, and optimization progress over time.

discovery_automation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
-- ================================================================
-- Automated Discovery Pipeline Components
-- ================================================================
 
-- =========================
-- PostgreSQL: Snapshot Collection Job
-- =========================
 
-- Create table to store historical snapshots
CREATE TABLE IF NOT EXISTS query_stats_history (
    snapshot_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    queryid BIGINT,
    query TEXT,
    calls BIGINT,
    total_exec_time DOUBLE PRECISION,
    mean_exec_time DOUBLE PRECISION,
    rows BIGINT,
    shared_blks_hit BIGINT,
    shared_blks_read BIGINT,
    PRIMARY KEY (snapshot_time, queryid)
);
 
-- Snapshot procedure (run hourly via pg_cron or external scheduler)
INSERT INTO query_stats_history 
    (queryid, query, calls, total_exec_time, mean_exec_time, rows, 
     shared_blks_hit, shared_blks_read)
SELECT 
    queryid, query, calls, total_exec_time, mean_exec_time, rows,
    shared_blks_hit, shared_blks_read
FROM pg_stat_statements
WHERE calls > 10;
 
-- Detect regressions: queries slower today than 7-day average
SELECT 
    current_stats.query,
    current_stats.mean_exec_time AS current_avg_ms,
    historical_avg.avg_exec_time AS historical_avg_ms,
    current_stats.mean_exec_time / historical_avg.avg_exec_time AS regression_factor
FROM pg_stat_statements current_stats
JOIN (
    SELECT queryid, AVG(mean_exec_time) AS avg_exec_time
    FROM query_stats_history
    WHERE snapshot_time > NOW() - INTERVAL '7 days'
      AND snapshot_time < NOW() - INTERVAL '1 day'
    GROUP BY queryid
) historical_avg ON current_stats.queryid = historical_avg.queryid
WHERE current_stats.mean_exec_time > historical_avg.avg_exec_time * 2
ORDER BY regression_factor DESC;
 
-- Detect new query patterns (first seen in last 24 hours)
SELECT 
    pss.queryid,
    pss.query,
    pss.calls,
    pss.mean_exec_time
FROM pg_stat_statements pss
LEFT JOIN query_stats_history h 
    ON pss.queryid = h.queryid 
    AND h.snapshot_time < NOW() - INTERVAL '1 day'
WHERE h.queryid IS NULL
  AND pss.calls > 100
ORDER BY pss.total_exec_time DESC;
 
-- =========================
-- SQL Server: Automated Alert Procedure
-- =========================
 
-- Procedure to check for slow queries and log/alert
CREATE PROCEDURE dbo.CheckSlowQueries
AS
BEGIN
    SET NOCOUNT ON;
    
    -- Find currently long-running queries
    INSERT INTO dbo.SlowQueryAlerts (session_id, query_text, elapsed_ms, cpu_ms, logical_reads, alert_time)
    SELECT 
        r.session_id,
        SUBSTRING(st.text, (r.statement_start_offset/2)+1, 4000),
        r.total_elapsed_time,
        r.cpu_time,
        r.logical_reads,
        GETDATE()
    FROM sys.dm_exec_requests r
    CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
    WHERE r.total_elapsed_time > 30000  -- 30 seconds
      AND r.session_id > 50;
    
    -- Check for regression from Query Store baseline
    INSERT INTO dbo.RegressionAlerts (query_id, current_avg_ms, baseline_avg_ms, regression_factor, alert_time)
    SELECT 
        q.query_id,
        current_perf.avg_ms,
        baseline_perf.avg_ms,
        current_perf.avg_ms / baseline_perf.avg_ms,
        GETDATE()
    FROM sys.query_store_query q
    CROSS APPLY (
        SELECT AVG(rs.avg_duration) / 1000.0 AS avg_ms
        FROM sys.query_store_plan p
        JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
        JOIN sys.query_store_runtime_stats_interval rsi 
            ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
        WHERE p.query_id = q.query_id
          AND rsi.start_time > DATEADD(hour, -24, GETUTCDATE())
    ) current_perf
    CROSS APPLY (
        SELECT AVG(rs.avg_duration) / 1000.0 AS avg_ms
        FROM sys.query_store_plan p
        JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
        JOIN sys.query_store_runtime_stats_interval rsi 
            ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
        WHERE p.query_id = q.query_id
          AND rsi.start_time BETWEEN DATEADD(day, -7, GETUTCDATE()) 
                                AND DATEADD(day, -1, GETUTCDATE())
    ) baseline_perf
    WHERE current_perf.avg_ms > baseline_perf.avg_ms * 2
      AND baseline_perf.avg_ms > 100;  -- Only significant queries
END;
GO
 
-- Schedule with SQL Server Agent job to run every 5 minutes
 
-- =========================
-- Alert Query for Dashboard/Monitoring
-- =========================
 
-- Summary of optimization candidates
SELECT 
    'High Impact' AS category,
    COUNT(*) AS query_count,
    SUM(total_exec_time) / 1000000.0 AS total_hours
FROM pg_stat_statements
WHERE total_exec_time > (SELECT SUM(total_exec_time) * 0.05 FROM pg_stat_statements)
UNION ALL
SELECT 
    'Regression' AS category,
    COUNT(*) AS query_count,
    NULL
FROM (
    SELECT queryid FROM pg_stat_statements pss
    -- ... regression detection logic
) regressions
UNION ALL
SELECT 
    'New Patterns' AS category,  
    COUNT(*) AS query_count,
    NULL
FROM (
    SELECT queryid FROM pg_stat_statements pss
    -- ... new pattern detection logic
) new_patterns;

Summary: Identifying Slow Queries

Systematic slow query identification transforms performance work from reactive troubleshooting into proactive engineering. Multiple detection strategies complement each other to provide comprehensive coverage.

Key Takeaways

•Slow query logs capture individual occurrences — Configure thresholds appropriately; too low creates noise, too high misses problems.
•Aggregate statistics reveal patterns — Total resource consumption often matters more than per-execution latency.
•Percentile analysis surfaces tail latency — Average hides outliers that disproportionately impact user experience.
•Real-time detection enables proactive response — Know about problems before users report them.
•Workload categorization enables prioritization — Not all slow queries have equal business impact; focus on what matters.
•Automation scales discovery — Continuous monitoring, regression detection, and alerting beat manual log review.
•Multiple strategies complement each other — Combine threshold logging, aggregate analysis, and real-time monitoring for complete visibility.

What's next:

Once you've identified slow queries, the next step is fixing them. The next page covers query rewriting—systematic techniques for transforming slow SQL into efficient SQL without changing the result set.

Page Complete

You now possess a comprehensive toolkit for slow query discovery—from basic logging through automated detection pipelines. You can implement these techniques across any database platform and scale from small applications to enterprise workloads.